Download Introducing Probability and Statistics:

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Introducing Probability and Statistics:
A concise course on the fundamentals of statistics
by Dr Robert G Aykroyd
Department of Statistics, University of Leeds
c
RG
Aykroyd and University of Leeds, 2014. Section 4 produced in collaboration with S Barber.
Introducing Probability and Statistics
“Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the
natural and social sciences to the humanities, government and business.
Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modelled in a way that accounts for
randomness and uncertainty in the observations, and are then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential
statistics comprise applied statistics. There is also a discipline called mathematical statistics, which
is concerned with the theoretical basis of the subject.”
(Source: http://en.wikipedia.org/wiki/Statistics)
This short course aims to give a quick reminder of many basic ideas in probability and statistics. The
material is selected from an undergraduate module on mathematical statistics, and hence emphasises
the “theoretical basis” which underpins applied statistics. The topics covered are a mix of practical
methods and mathematical foundations. If you are familiar with most of these ideas, then you are
well prepared for your studies. If, on the other hand, you find some of the topics new then please take
some extra time to understand the ideas and complete the exercises.
Outline of the course:
1.
BASIC PROBABILITY. Events, sample space and the axioms. Random variables.
Expectation and variance.
C ONDITIONAL P ROBABILITY. Conditional probability and independence. Expectation and variance. Total probability and Bayes Theorem.
S TANDARD D ISTRIBUTIONS. Binomial, Poisson, exponential and normal. Moment generating functions. Sampling distributions.
1
4.
5.
L INEAR R EGRESSION. The linear regression model. Vector form of regression.
C LASSICAL E STIMATION. Method of Moments. Maximum likelihood. Properties
of estimators. Hypothesis Testing. Likelihood ratio test. Exercises.
13
17
6.
T HE N ORMAL D ISTRIBUTION. Transformations to normality. Approximations
and the central limit theorem.
D ERIVED D ISTRIBUTIONS. Function of random variables. Sums of independent
variables. Student-t, Chi-squared and F distributions. Exercises.
22
BAYESIAN ESTIMATION. Subjective probability and expert opinion. Definitions of
prior, likelihood and posterior. Posterior estimation. Exercises.
30
Practical Exercises.
Solutions to Practical Exercises.
Solutions to Theoretical Exercises.
Standard Distributions and Tables.
35
37
39
50
2.
3.
7.
8.
Useful references:
Rice JA, Mathematical Statistics and Data Analysis, Duxbury Press, 2nd Ed, 1995
Stirzaker DR, Elementary Probability, CUP, 2003 (online at University library).
4
8
24
I NTRODUCING P ROBABILITY AND S TATISTICS
1
1.1
1 BASIC P ROBABILITY
Basic Probability
Introduction
Probability is a branch of mathematics which rigorously describes uncertain (or random) systems and
processes. It has its roots in the 16th/17th century with the work of Cardano, Fermat and Pascal, but
it is also an area of modern development and research. Put simply, probability measures the likelihood or chance of some event occurring: probability zero means the event is impossible whereas a
probability of 1 means that the event is certain. The larger the probability, the more likely the event.
Applications include: modelling hereditary disease in genetics, pension calculations in actuarial science, stock pricing in finance, epidemic modelling in public health, and many more!
1.2
Events and axioms
The set of all possible outcomes is the sample space Ω (the Greek letter, capital “omega”), and we
may be interested in the chance of some particular outcome, or event, occurring.
An event, often denoted A, B, C, · · · , is a set of outcomes of an experiment. The set can be empty,
A = ∅, giving an impossible event, or it can be equal to the sample space, A = Ω, giving a certain
event. These extremes are not very interesting and so the event will usually be a non-empty, proper
subset of the sample space.
Probabilities must satisfy the following simple rules:
The (Kolmogorov) axioms:
K1 P r(A) ≥ 0 for any event A,
K2 P r(Ω) = 1 for any sample space Ω,
K3 P r(A ∪ B) = P r(A) + P r(B) for any mutually exclusive events A and B (that is when A ∩ B = ∅).
Clearly, these are very basic properties but they are sufficient to allow many complex rules to be
derived, such as:
The general addition rule:
P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B).
F URTHER READING: Sections 1.2-1.4 of Stirzaker.
1
I NTRODUCING P ROBABILITY AND S TATISTICS
1.3
1 BASIC P ROBABILITY
Random variables
Whenever the outcome of a random experiment is a number, then the experiment can be described by
a random variable. It is conventional to use capital letters to denote random variables, e.g. X, Y, Z.
The range space of a random variable X, is a set, SX , of all possible values of the random variable,
eg SX = {a1 , a2 , ..., ar , ...} or SX = [0, ∞). A discrete random variable is a random variable with a
finite (or countably infinite) range space. A continuous random variable is a random variable with an
uncountable range space.
For a discrete random variable, X say, the probability of the random variable taking a particular
element of the range space is P r(X = ar ) (or pX (x)) – this is called the probability mass function.
When the random variable, Y say, is continuous we have a function, fY (y), to describe the density of
probability over the range space – this is called the probability density function.
Alternatively, the probabilities may be summarised by a distribution function defined by
FZ (z) = P r(Z ≤ z).
For discrete random variables this is obtained by summing the probability mass function, and for
continuous random variables by integrating the probability density function,
FX (x) =
X
Z
pX (r)
y
FY (y) =
fY (t)dt.
−∞
r≤x
As a consequence of this last result, a probability density function can be obtained from the corresponding distribution function by differentiating
fY (y) =
d
(FY (y)) .
dy
F URTHER READING: Sections 2.1, 2.2 and 15.3.2 of Rice and Section 4.1 of Stirzaker. An interesting
discussion of randomness can be found at http://en.wikipedia.org/wiki/Randomness
2
I NTRODUCING P ROBABILITY AND S TATISTICS
1.4
1 BASIC P ROBABILITY
Expectations and variance
The expectation (or mean) of a random variable is defined as:
E [X] =
 P
 x xp(x)
 R
x
for discrete X
xf (x)dx for continuous X
and the moments (about zero) are defined by:
r
E [X ] =
 P r
 x x p(x)
 R
x
for discrete X
xr f (x)dx for continuous X
The expectation of a function of a random variable is given by:
E [g(X)] =
 P
 x g(x)p(x)
 R
x
for discrete X
g(x)f (x)dx for continuous X
The variance
V ar(X) = E (X − µ)2 =

 P
x (x
 R
x
− µ)2 p(x)
(discrete)
(x − µ)2 f (x)dx (continuous)
where µ = E[X]. It is usually easier, however, to calculate the variance using
V ar(X) = E X 2 − {E[X]}2
where E [X 2 ] is the expectation of X-squared, i.e. the second moment about zero.
F URTHER READING: Section 4.3 of Stirzaker, and http://en.wikipedia.org/wiki/Expected value
3
I NTRODUCING P ROBABILITY AND S TATISTICS
2
2.1
2 C ONDITIONAL P ROBABILITY
Conditional Probability
Definitions
For two discrete random variables, X and Y , we have:
J OINT PROBABILITY MASS FUNCTION:
p(x, y) = P r(X = x, Y = y),
where (i) 0 ≤ p(x, y) ≤ 1, for all x, y, and
(ii)
XX
x
p(x, y) = 1.
y
M ARGINAL PROBABILITY MASS FUNCTIONS:
pX (x) =
X
p(x, y)
pY (y)
=
X
p(x, y)
x
y
C ONDITIONAL PROBABILITY MASS FUNCTIONS:
pX|Y (x|y) =
p(x, y)
pY (y)
where pY (y) > 0,
pY |X (y|x) =
p(x, y)
pX (x)
where pX (x) > 0
F URTHER READING: Chapter 3 and Section 4.4 of Rice, and Sections 2.1 and 2.2 of Stirzaker.
4
I NTRODUCING P ROBABILITY AND S TATISTICS
2 C ONDITIONAL P ROBABILITY
Continuous case
For two continuous random variables, X and Y , we have:
J OINT PROBABILITY DENSITY FUNCTION:
f (x, y),
where
−∞ < x < ∞, −∞ < y < ∞,
(i)f (x, y) ≥ 0, for all x, y, and
Z
∞
Z
∞
f (x, y) dx dy = 1.
(ii)
y=−∞
x=−∞
M ARGINAL PROBABILITY DENSITY FUNCTIONS:
Z
∞
Z
f (x, y) dy
fX (x) =
fY (y)
∞
f (x, y) dx
=
x=−∞
y=−∞
C ONDITIONAL PROBABILITY DENSITY FUNCTIONS:
2.2
fX|Y (x|y) =
f (x, y)
fY (y)
where fY (y) > 0,
fY |X (y|x) =
f (x, y)
fX (x)
where fX (x) > 0
Independent random variables
Two random variables X and Y are independent if and only if
p(x, y) = pX (x)pY (y)
for all discrete x, y
f (x, y) = fX (x)fY (y)
for all continuous x, y.
F URTHER READING: Sections 5.1-5.3 and 4.4 of Stirzaker.
5
I NTRODUCING P ROBABILITY AND S TATISTICS
2.3
2 C ONDITIONAL P ROBABILITY
Expectations and correlation
Consider random variables X and Y , with joint probability density function f (x, y) or joint probability mass function p(x, y), then for any function h(x, y):
E [h(X, Y )] =
 P P
 x y h(x, y)p(x, y)
 R R
x
y
for discrete X, Y
h(x, y)f (x, y)dydx for continuous X, Y.
For example, the (r, s)th moment about zero E [X r Y s ], has h(x, y) = xr y s so that E[XY ] uses
h(x, y) = xy. Further, the (r, s)th moment about the mean, given by E [(X − µX )r (Y − µY )s ] has
h(x, y) = (x − µX )r (y − µY )s where µX = E[X] and µY = E[Y ].
Then the correlation of X and Y can be found as:
Cov(X, Y )
Corr (X, Y )) = p
V ar(X)V ar(Y )
where Cov(X, Y ) = E [(X − µX )(Y − µY )] is the covariance of X and Y . If X and Y are independent, then the covariance is zero and hence the correlation is zero – whenever the correlation is zero,
then the variables are said to be uncorrelated. Note, however, that in general uncorrelated does not
mean the variables are independent.
Given random variables X and Y , the conditional expectation of Y given that X = x is:
E [Y |X = x] =
 P
 y y pY |X (Y = y|X = x) (discrete)
 R
y
y fY |X (y|x)dy
(continuous).
Clearly, in either of these definitions the conditional distribution could be replaced by the ratio of joint
distribution to the appropriate marginal distribution.
For any function h(Y ), the conditional expectation of h(Y ) given X = x is given by
E [h(Y )|X = x] =
 P
 y h(y)pY |X (Y = y|X = x) (discrete)
 R
y
h(y)fY |X (y|x)dy
6
(continuous).
I NTRODUCING P ROBABILITY AND S TATISTICS
2.4
2 C ONDITIONAL P ROBABILITY
Total probability and Bayes Theorem
Suppose that we are interested in the probability of some event A, but that it is not easy to evaluate
p(A) directly. Firstly, let the events B1 , B2 , . . . , Bk partition the sample space. For B1 , B2 , . . . , Bk
to be a partition of the sample space Ω, they must be (i) mutually exclusive, that is Bi ∩ Bj = ∅ (for
i 6= j) and (ii) exhaustive, that is B1 ∪ B2 ∪ · · · ∪ Bk = Ω. Further suppose that we can easily find
p(A|Bj ) (for j = 1, . . . , k) then
Total probability rule:
P r(A) =
k
X
P r(A|Bj )P r(Bj ).
j=1
Further, suppose that we have a conditional probability, P r(A|B) for example, but we are interested
in the probability of the events conditioned the other way, that is P r(B|A), then
Bayes theorem (1):
P r(B|A) =
P r(A|B)P r(B)
P r(A)
when P r(A) > 0.
In general, if B1 , B2 , . . . , Bk is a partition, as above, and we use the total probability rule, then we
can write
Bayes theorem (2):
P r(Bi |A) =
P r(A|Bi )P r(Bi )
k
X
P r(A|Bj )P r(Bj )
j=1
F URTHER READING: Section 2.1 of Stirzaker.
7
i = 1, . . . , k.
I NTRODUCING P ROBABILITY AND S TATISTICS
3
3.1
3 S TANDARD D ISTRIBUTIONS
Standard Distributions
Example distributions
Binomial distribution, B(n, π)
The binomial distribution can be defined as the number of successes in n independent Bernoulli trials
with two possible outcomes (success and failure) with probabilities π and 1 − π.
n x
p(x) =
π (1−π)n−x ,
x
x = 0, 1, ..., n
(0 < π < 1).
V ar(X) = nπ(1 − π)
E[X] = nπ
Poisson distribution, P o(λ)
The Poisson distribution is often used as a model for the number of occurrences of rare events in time
or space, such as radioactive decays.
e−λ λx
,
p(x) =
x!
E[X] = λ
x = 0, 1, ...
(λ > 0).
V ar(X) = λ
Exponential distribution, exp(λ)
The exponential distribution is often used to describe the time between events which occur at random,
or to model “lifetimes”. It possesses the so-called “memoryless” property.
f (x) = λe−λx ,
E[X] =
1
λ
x≥0
(λ > 0).
V ar(X) =
1
λ2
Normal distribution, N (µ, σ 2 )
The normal (or Gaussian) distribution is the most widely used. It is convenient to use, often fits data
well and can be theoretically justified (via the central limit theorem).
1 (x − µ)2
exp −
f (x) = √
2 σ2
2πσ 2
1
E[X] = µ
,
−∞ < x < ∞.
V ar(X) = σ 2
F URTHER READING: Section 4.2, 4.3 and 7.1 of Stirzaker.
8
I NTRODUCING P ROBABILITY AND S TATISTICS
3.2
3 S TANDARD D ISTRIBUTIONS
Moment generating functions
The moment generating function (mgf) of a random variable X is defined as MX (t) = E[etX ] =
R tx
P tx
e
p
(x),
if
discrete;
e fX (x)dx if continuous, and it exists provided the sum or integral
X
x
x
converges in an interval containing t = 0.
1. The mgf is unique to a probability distribution.
2. By considering the (Taylor) power series expansion
MX (t) =
∞ r
X
t
r=0
r!
E[X r ],
we see that E[X r ] is the coefficient of tr /r!
3. Moments can easily be found by differentiation
d
r
E[X ] = r MX (t)
dt
r
t=0
i.e. E[X r ] is the rth derivative of MX (t) with t = 0.
4. If X has mgf MX (t) and Y = aX + b, where a and b are constants, then the mgf of Y is
MY (t) = ebt MX (at)
5. If X and Y are independent random variables with mgfs MX (t) and MY (t) respectively, then
Z = X + Y has mgf given by
MZ (t) = MX (t)MY (t).
Extending this to n independent random variables, Xi , i = 1, 2, ..., n with mgfs Mxi (t), i =
P
1, 2, ..., n, then the mgf of Z = Xi is
MZ (t) = MX1 (t)MX2 (t)...MXn (t).
If Xi , i = 1, 2, ..., n are independent and identically distributed (i.i.d.) with common mgf
MX (t) then
MZ (t) = {MX (t)}n .
6. If {Xn } is a sequence of random variables with mgfs {MXn (t)}, and X is a random variable
with mgf MX (t) such that
lim MXn (t) = MX (t)
n→∞
then the limiting distribution of Xn is the distribution of X.
9
I NTRODUCING P ROBABILITY AND S TATISTICS
3.3
3 S TANDARD D ISTRIBUTIONS
Sampling and sampling distributions
The first task of any research project is the design of the investigation. It is important to gather all
information regarding the problem from historical records and from experts. This allows each part of
the experimental design, modelling and even analysis to be planned.
The target population is the set of all people, products or things about which we would like to draw
conclusions. Typically we will be interested in some particular characteristic of the population, such
as weight or risk associated with a particular financial product. The sample is a, usually small, sub-set
of the population and is selected in such a way as to be representative of the population. We will
then use the sample to draw conclusions about the population. The choice of sample size depends on
many factors such as the sampling method, the natural variability, measurement error and the required
precision of any estimation or the power of any hypothesis tests.
Suppose we have a random sample of n observations or measurements, x1 , . . . , xn , of a random
variable X. It is very common to summarise the sample using a small number of sample statistics,
rather than report the whole sample. The most usual summary statistics are the sample mean and the
P
P
1
sample variance x̄ = n1
xi and s2 = n−1
(xi − x̄)2 . Other sample summaries are possible, such
as median and mode as measures of centre or location of the distribution, or range and inter-quartile
range as measures of the spread of the distribution.
As well as numerical statistics, it is common to consider graphical representations. Stem-and-leaf and
box plots can be used to display the numerical summaries and are particularly useful for comparing
general properties between samples. Also, histograms can help to choose, or confirm, a probability
model. Numerical statistics are then used to estimate model parameters, for example using the sample
proportion to estimate the probability in the binomial, or the sample mean and variance to estimation
to population mean and variance in the normal.
If we were to repeat the sampling process to obtain other datasets, then we would not expect the
various summary statistics to be unchanged – this is due to sampling variation. We can imagine
performing the sampling many times and looking at the distribution of the summary statistic – this is
the sampling distribution.
Suppose we have a random sample, X1 , . . . , Xn , from a normal population with mean µ and variance
σ 2 . It can be shown that the sampling distribution of the sample mean also has a normal distribution
with mean µ but with variance σ 2 /n, that is X̄ ∼ N (µ, σ 2 /n). We can also derived results about
other distributions. For example a good estimator of the probability, π, in the Binomial distribution
b = X̄ is a good choice. Notice that each of these is a function of
is X̄/n, and that for the Poisson λ
the mean and, although, the data are not from a normal distribution we can call on the central limit
theorem, if we have a large sample, to justify a normal approximation. That is in the Binomial π
b is
b
approximately N (π, π(1 − π)/n), and for the Poisson λ approximately follows N (λ, λ/n).
10
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
Exercises
(1.1) Let X be a random variable with probability mass function pX (x) given by
x
pX (x)
-3
0.1
-1
0.2
0
0.15
1
0.2
2
0.1
3
0.15
5
0.05
8
0.05
Check that pX (x) defines a valid probability distribution, then calculate P (1 ≤ X ≤ 4) and
P (X is negative). Evaluate the expected value E[X] and the variance V ar(X).
(1.2) Suppose that X has a PDF
fX (x) = cx(2 − x) for
0 ≤ x ≤ 2.
Find the constant c so that this is a valid PDF. Obtain the cumulative distribution function
FX (x), and then find the probability that X > 1.
(2.1) Let X and Y have the joint probability mass function given in the following table.
x
1
2
3
-2
1/32
1/16
1/32
Value of y
-1
0
3/32 3/32
3/16 3/16
3/32 3/32
1
1/32
1/16
1/32
Find the cumulative distribution function of Y , and the conditional distribution of Y given X.
Are X and Y independent?
(2.2) The joint PDF of X and Y is given by
6 2 xy f (x, y) =
x +
,
7
2
0 ≤ x ≤ 1, 0 ≤ y ≤ 2.
Find the marginal PDFs of X and Y , and then the cumulative distribution function of X. Evaluate the expectation of X, the expectation of X(X − Y ), and the conditional expectation of X
given that Y = 1.
(2.3) A laboratory blood test is 80% effective in detecting a certain disease when it is in fact present.
However, the test also yields a ‘false positive’ result for 5% of healthy persons tested. Suppose
that 0.4% of the population actually have the disease. What is the probability that a person
found ‘ill’ according to the test does have the disease?
(3.1) An exam paper consists of 20 multiple choice questions with 5 possible answers each (only one
is correct). In order to get a pass mark, it is necessary to give correct answers to at least 20% of
questions.
(a) A student has decided to answer just by guessing. What is the probability that he would
pass the exam?
11
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
(b) Suppose now that the student pursues an “educated guess”, in that he knows enough to
be able to discard two most unlikely answers for each of 20 questions, and will guess at
random on the remaining two answers. What are his chances to pass the exam now?
(3.2) Suppose that X has an exponential distribution with p.d.f. given by f (x) = λe−λx for x ≥ 0
and f (x) = 0 otherwise, and suppose that λ = 2.
(a) Evaluate the probability P r(X > 21 ).
(b) Find the value of x such that FX (x) = 21 .
(c) Evaluate the probability P r(X > 1 | X > 12 ).
(3.3) Suppose that X has a Poisson distribution with parameter λ, then find the MGF and hence the
mean and variance of X. Using MGFs, show that the sum of two independent Poisson random
variables is also a Poisson random variable.
12
I NTRODUCING P ROBABILITY AND S TATISTICS
4
4 L INEAR R EGRESSION
Linear regression and least squares estimation
4.1
Introduction
In many sampling situations it is essential to consider related variables simultaneously. Even in situations where there is an exact physical law, measured data will be subject to random fluctuations
and hence fitting a functional relationship to data is a common task. There may be some information
before an experiment about the type of relationship expected, and this may have been used in the
experimental design, but it is always wise to visualize the possible relationship using a scatterplot.
The most commonly used model is the straight line. This may be due to a physical law which is
linear, or as an local approximation to a nonlinear relationship. In other cases there will be no theoretical justification but it is simply chosen because the data seem to follow a linear pattern. In all
cases, it is important to check that this assumption is reasonable both before the analysis, by drawing
a scatterplot, and afterwards by performing a residual analysis – these are not covered in this course.
4.2
The linear regression model
Suppose we have a dataset containing n paired values, {(xi , yi ) : i = 1, . . . , n}. Consider the simple
linear regression model
yi = α + βxi + i
i = 1, . . . , n,
where
yi is the response or dependent variable,
α and β are regression parameters,
xi is the independent or explanatory variable, measured without error, and
i is the random error term and is independent and identically distributed (iid) N (0, σ 2 ).
Consider a general straight line passing through the data plotted as points on a scatterplot. In general,
the points will not lie perfectly on the line, but instead there is an error, or residual, associated with
each point. Let the straight line be defined by the equation y = α + βx and the y-value on the line
when x = xi is denoted ybi = α + βxi . Now, since we are assuming that the explanatory variable is
measured without error, then the residuals are measured only in the y direction. So
ri = yi − ybi = yi − (α + βxi )
i = 1, . . . , n.
Given observations {(xi , yi ); i = 1, . . . , n}, we estimate the regression parameters by least squares.
P
To do this, we minimise the sum of squared residuals S = i ri2 .
13
I NTRODUCING P ROBABILITY AND S TATISTICS
4 L INEAR R EGRESSION
Using partial differentiation of S with respect to α and β separately we obtain
∂S
∂α
X
= −2
(yi − α − βxi )
i
X
∂S
= −2 xi (yi − α − βxi ) .
∂β
i
b when
The minimum can be found by solving these equations, giving parameter estimates α
b and β,
∂S/∂α = 0 and ∂S/∂β = 0, that is when
X
X
yi = nb
α + βb
xi
i
i
X
X
X
xi yi = α
b
xi + βb
x2i
i
i
i
— these are called the normal equations. Dividing the first by n gives ȳ = α
b + βbx̄, then substituting
b gives yb − ȳ = β(x
b − x̄). Notice that when x = x̄ then y = ȳ,
for α
b in the fitted equation, yb = α
b + βx,
that is the line passes through the centroid, (x̄, ȳ), of the data.
Now, dividing the first normal equation by n and re-arranging gives
α
b=
1X
1X
yi − βb
xi
n i
n i
which can be substituted into the second normal equation, and after a few steps leads to
!
X
X X
X
X X
xi yi =
xi yi /n + βb
x2 −
xi xi /n .
i
i
i
i
i
i
i
Which gives the result, in two alternative forms,
P
P P
P
(xi − x̄)(yi − ȳ)
x
y
−
x
y
/n
i
i
i
i
βb = Pi 2 P i P i
= iP
.
2
i xi −
i xi
i xi /n
i (xi − x̄)
Summarizing this, the least-squares estimator are
P
b
α
b = y − βx
and
We can also write βb = Sxy /Sxx , where we define
X
Sxx =
(xi − x)2
and
i
Sxy =
− x)(yi − y)
.
2
i (xi − x)
i (x
Pi
βb =
X
(xi − x)(yi − y).
i
To make predictions of the response, yb, corresponding to values of the explanatory variable, x, we
b Similarly, fitted values, ybi , can be calculated
simply substitute into the fitted equation, yb = α
b + βx.
b i.
corresponding to observed values of the explanatory variable, xi by substitution as ybi = α
b + βx
14
I NTRODUCING P ROBABILITY AND S TATISTICS
4.3
4 L INEAR R EGRESSION
Vector form of linear regression
Consider the centred linear regression model where x̄ has been subtracted from all x-values
yi = α0 + β(xi − x) + i
i = 1, . . . , n.
We keep the same notation for the slope parameter but relabel the intercept. Comparing the two
b + βb x = y.
versions we see that α0 = α + β x so α
b0 = α
b + βb x = (y − βx)
We can write our centred regression model in vector form as y = X θ + where

 
 

1
y1
1 x1 − x
" #

 
 

0
 2 
 y2 
1 x2 − x 
α



and = 
y=
.. 
, θ = β ,
 ..  .
 ..  , X =  ..
. 
.
.
.
1 xn − x
yn
n
Note that we could write the uncentred model in a similar form — we would just have to change
the second column of X. Note also that we could include multiple explanatory variables simply by
adding more columns to X and more parameters to θ.
We can estimate θ by least squares. Note that r = y − Xθ and minimise
X
S=
ri2 = r T r
i
= (y − Xθ)T (y − Xθ)
= y T y − (Xθ)T y − y T Xθ + (Xθ)T (Xθ)
= y T y − 2θ T X T y + θ T X T Xθ.
Differentiating S with respect to θ gives
∂S
= −2X T y + 2X T Xθ,
∂θ
and setting this to zero gives the set of equations X T X θb = X T y, which defines the least squares
estimators
!
αb0
= θb = (X T X)−1 X T y.
βb
You should check for yourself that this has gives the same parameter estimates as before.
Example: Consider the following data on pullover sales (number of pullovers sold) and price (in
EUR) per item. The aim is to discover if the price (explanatory variables) influences the overall sales
(response variable).
Sales
Price
230
125
181 165
99 97
150
115
97
120
15
192
100
181
80
189
90
172
95
170
125
I NTRODUCING P ROBABILITY AND S TATISTICS
4 L INEAR R EGRESSION
To investigate the relationship between sales (y) and price (x) we can calculate the following values.
x2i
Sales, yi
15625
230
9801
181
9409
165
13225
150
14400
97
10000
192
6400
181
8100
189
9025
172
15625
170
1727
2198.4
P 2
P
( x i ) ( yi )
xi yi
28750
17919
16005
17250
11640
19200
14480
17010
16340
21250
179844
P
( xi y i )
Fig: Scatterplot of pullover sales against
price, with fitted equation.
y
100 120 140 160 180 200 220
Price, xi
125
99
97
115
120
100
80
90
95
125
Total
1046
P
( xi )
80
90
100
x
110
120
First, the means are x = 104.6 and y = 172.7, and then Sxx = 2198.4 and Sxy = −800.2 giving
βb = Sxy /Sxx = −800.2/2198.4 = −0.364 (to 3 d.p.) and α
b = y − βbx̄ = 172.7 + 0.364 × 104.6 =
210.774. Hence we have the fitted equation
Sales = 210.774 − 0.364 × Price.
Alternatively, we can fit our regression model by constructing the matrix representation form, defining


 
1
20.4
230


 
 1 −5.6 
181


 
165
 1 −7.6 

 


 

10.4 
150
 1

 


 97 
 1
15.4


and
X=
y=
192
 1 −4.6  .

 


 

181
 1 −24.6 

 

189
 1 −14.6 


 


 
 1 −9.6 
172
170
1
20.4
Then we can find α
b0 and βb using
" #
"
#−1 "
# "
#
0
α
b
1727.0
172.7
10.0
0.0
= (X T X)−1 X T y =
=
.
b
β
0.0 2198.4
−800.2
−0.364
Hence α
b=α
b0 − βbx̄ = 172.7 + 0.364 × 104.6 = 210.774, as before.
16
I NTRODUCING P ROBABILITY AND S TATISTICS
5
5.1
5 E STIMATION AND T ESTING
Classical Estimation and Hypothesis Testing
Introduction
Statistical inference is the process where we attempt to say something about an unknown probability
model based on a set of data which were generated by the model. This inference does not have the
status of absolute truth, since there will be (infinitely) many probability models which are consistent
with a given set of data. All we can do is to establish that some of these models are plausible, while
others are implausible.
A common approach is to use a probability model for the data which is completely specified except
for the numerical values of a finite number of quantities called parameters. In this chapter we will
introduce methods for making inferences about parameters assuming that the given model is correct.
The idea is to use the data, xT = (x1 , x2 , ..., xn ), to make a “good guess” at the numerical value of a
parameter, θ.
b
An ESTIMATE, θb = θ(x)
is a numeric value which is a function of the data. An ESTIMATOR is a
b
random variables, θ(X),
which is a function of a random sample X T = (X1 , X2 , ..., Xn ).
5.2
Method of Moments
Assume that the Xi are mutually independent with common p.d.f. f (x; θ1 , θ2 , ..., θp ). Then the rth
population moment (about zero) is
E[X r ] = µ0r (θ1 , θ2 , ..., θp )
and the rth sample moment
n
m0r =
1X r
x.
n i=1 i
The method of moments estimates θ1 , θ2 , ..., θp is the solution of the p simultaneous (non-linear)
equations
µ0r (θ1 , θ2 , ..., θp ) = m0r ,
r = 1, 2, ..., p.
This method of estimation has no general optimality properties and sometimes does very badly, but
usually provides sensible initial guesses for numerical search procedures.
Example
Let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n be a
set of independent observations of this variable. The first sample moment is the sample mean x̄, and
the first population moment is the expectation of X, i.e. 1/λ. Hence, we find the method of moments
b = x̄, that is λ
b = 1/x̄.
estimate of λ by solving 1/λ
F URTHER READING: Sections 8.1 to 8.5 of Rice.
17
I NTRODUCING P ROBABILITY AND S TATISTICS
5.3
5 E STIMATION AND T ESTING
Maximum likelihood estimation
The joint pdf of a set of data can be written as f (x; θ). Think of this as a function of θ for a particular
data set, and define the likelihood function
L(θ) = f (x; θ).
An obvious guess at θ is the value which maximises the likelihood, that is the most plausible value
given the data. This value is called the maximum likelihood estimate (mle).
For technical reasons it is usual to work with the log-likelihood, l(θ) = log L(θ) = log f (x; θ).
Further note that if the Xi are mutually independent with common pdf f (.) then
l(θ) =
n
X
log f (xi ; θ).
i=1
Maximum likelihood estimation enjoys strong optimality properties (at least in large samples).
Example
Again let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n
be a set of independent observations of this variable.
The log-likelihood is given by
n
X
log f (xi ; θ) = n log(λ) − λ
i=1
n
X
xi
i=1
To find the maximum, differentiate with respect to λ, set equal to zero and solve. This produces the
b = 1/x̄. In this case, the m.l.e. is the same as the method of moments estimate.
estimate λ
5.4
Properties
b is unbiased for, θ, if
1. The most important property is UNBIASEDNESS. An estimator, θ,
b = θ.
E[θ]
b 6= θ, but E[θ]
b → θ as n → ∞ then the estimator is
If, for small n, E[θ]
UNBIASED .
ASYMPTOTICALLY
2. An (unbiased) estimator is CONSISTENT if
b →0
V ar(θ)
as n → ∞.
3. If we have two (or more) estimators, θb and θ̃, which are unbiased, then we might choose the
one with smallest variance. The EFFICIENCY of θb relative to θ̃ is defined to be
b θ̃) = V ar(θ̃) .
ef f (θ,
b
V ar(θ)
18
I NTRODUCING P ROBABILITY AND S TATISTICS
5.5
5 E STIMATION AND T ESTING
Hypothesis Testing
Let X1 , . . . , Xn be a random sample from a distribution. The general approach to statistical testing is
to consider whether the data are consistent with some stated theory or hypothesis.
A hypothesis is a statement about the true probability model, though usually this only concerns the
parameter within some specified family, for example N (µ, 1).
A simple hypothesis specifies a single point value for parameter, for example µ = µ0 , whereas a
composite hypothesis specifies a range, or set, of values, for example µ < µ0 or µ 6= µ0 .
We usually assume that there are two rival hypotheses:
• Null hypothesis, H0 : µ = µ0 (usually well-defined and simple),
• Alternative hypothesis, H1 : “some statement” which is a competitor to H0 .
Note: We do not claim to show that H0 or H1 is true, but only to assess if the data provide sufficient
evidence to doubt H0 . The null hypothesis usually represents a “bench mark” or a “skeptical stance”,
for example “this treatment has no effect on the response”, and we will only reject it if there is
overwhelming evidence against it.
We make a decision whether to accept H0 or to accept H1 (that is reject H0 ) on the basis of the data.
Of course any conclusions is bound to be chancy! There must be a (non-zero) probability of a wrong
action, and this is a major characteristic of a statistical test procedure.
The types of error can be summarised as follows:
Decision
Accept H0
Reject H0
H0 True H0 False
Correct
Wrong
Wrong
Correct
Then, we define, α, the significance level
α = P r(Type I Error) = P r(Reject H0 when H0 is true).
This is considered the most important of the two types of error, and of course we want α to be small.
Next consider the other error, and define the probability as β,
β = P r(Type II Error) = P r(Accept H0 when H0 is false)
which should be small. Also we define the power function
φ = 1 − P r(Type II error) = 1 − β
and clearly we want this to be large – a powerful test. Note that this will be a function of the, unknown,
true parameter.
19
I NTRODUCING P ROBABILITY AND S TATISTICS
5.6
5 E STIMATION AND T ESTING
Examples of simple hypothesis tests
Throughout the following examples suppose we have a sample of n observations x1 , x2 , . . . , xn from
a normally distributed population with mean µ and variance σ 2 .
Example 1: Suppose that we know the population variance, but we do not know the population
mean. From the sample we can estimate the population mean using the sample mean, µ
b = x̄. We
might now wish to test the hypothesis that the population mean is some specified value µ0 compared
to the hypothesis that it is not equal to the specified value. That is null hypothesis H0 : µ = µ0 against
alternative hypothesis H1 : µ 6= µ0 .
A suitable test statistic is the (observed) z-value
zobs =
x̄ − µ0
√
σ/ n
which is then compared to the standard normal distribution.
Example 2: Suppose now that the population variance is unknown. For the same hypotheses as
above, H0 : µ = µ0 against H1 : µ 6= µ0 , the corresponding test statistic is the (observed) t-value
tobs =
x̄ − µ0
√
sn−1 / n
P
where sn−1 is the sample standard deviation defined by s2 = (xi − x̄)2 /(n − 1). This is compared
to the (so called) t-distribution with n − 1 degrees of freedom.
5.7
The likelihood ratio test
Consider a random sample X1 , . . . , Xn from some distribution with parameter θ and that we wish to
test H0 : θ = θ0 , against H1 : θ 6= θ0 . The likelihood ratio statistic is defined as:
Λ=
L(θ0 )
b
L(θ)
where θb is the maximum likelihood estimate of θ. Note that 0 ≤ Λ ≤ 1. If there are other unknown
parameters, then these are replaced using the appropriate maximum likelihood estimates. We then
reject the null hypothesis if Λ if less than some specified value Λ0 say. This is intuitive since values
close to zero suggests H1 is true, whereas values close to 1 suggests H0 is true. It is usual to work
with the log likelihood-ratio, and we reject if λ = log Λ is close to zero.
Equivalently, subject to some conditions, and n is large, Wilkes’ theorem states that
W = −2 log Λ is approximately χ21 under H0 .
We now reject H0 is W is large, and in particular if it is larger than χ21 (1 − α) for a α100% test.
20
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
Exercises
(4.1) A study was made on the amount of converted sugar in a fermentation process at various temperatures. The data were coded (by subtracting 21.5 degrees centigrade from the temperatures)
and recorded as follows:
Sugar remaining after fermentation
Temp., x
Sugar, y
-0.5
8.1
-0.4
7.8
-0.3
8.5
-0.2
9.8
-0.1 0
9.5 8.9
0.1 0.2
8.6 10.2
0.3 0.4 0.5
9.3 9.2 10.5
Why do you think that the data were coded by subtracting 21.5 from the temperatures?
Fit a linear regression model and find the regression equation for your model.
(4.2) Consider the multiple linear regression of a response y on two explanatory variables x and w
using the model
y i = α + β 1 xi + β 2 w i + εi
i = 1, . . . , n.
P
P
Assume that the predictors x and w are already centred, so i xi = i wi = 0. Use the method
of least squares to find α
b and show that βb1 is given by
−1 2
Swx
Swx Swy
Sxy
b
β1 = 1 −
−
.
Sww Sxx
Sxx Sww Sxx
where Sxy and Sxx are as defined in the notes and
X
Sww =
(wi − w̄)2 ,
i
Swx
X
=
(wi − w̄)(xi − x̄),
and Swy
X
=
(wi − w̄)(yi − ȳ).
i
i
(5.1) In a survey of 320 families with 5 children the number of girls occurred with the following
frequencies:
Number of girls 0 1 2
3
4 5
Frequency
8 40 88 110 56 18
Explain why the binomial distribution might be a suitable model for this data and clearly state
any assumptions. Derive the equation for the maximum likelihood estimator of p, the probability of a girl, and then estimate the value using the data.
(5.2) To study a particular currency exchange rate it is possible to model the daily change in log
exchange rate by a normal distribution. Suppose the following is a random sample of 10 such
values:
0.05 0.29 0.39 -0.18 0.11 0.15 0.35 0.28 -0.17 0.07
Use these data to perform a 5% hypothesis test that the mean change is equal to zero.
21
I NTRODUCING P ROBABILITY AND S TATISTICS
6
6.1
6 T HE N ORMAL D ISTRIBUTION
The Normal Distribution
Introduction
Many statistical method of constructing confidence intervals and hypothesis tests are based on an
assumption of normality. The assumption of normality often leads to procedures which are simple,
mathematically tractable, and powerful compared to corresponding approaches which do not make
the normality assumption. When dealing with large samples results such as the central limit theorem
give us confidence that small departures are unlikely to be important. With small samples or when
there are substantial violations of a normality assumption the chances of misinterpreting the data and
drawing incorrect conclusions seriously increases. Because of this we must carefully consider all
assumptions throughout data analysis.
Once data have been collected it is important to check modelling assumptions. There are several
ways to tell whether a dataset is substantially non-normal such as calculation and testing of skew
and kurtosis, and examination of histograms and probability plots. Histograms “approximate” the
true probability distribution but can be greatly effected by choice of histograms bins etc. Another approach is to consider the probability plot (or Quantile-Quantile plot) where the expected or theoretical
quantiles are plotted against the sample quantiles. If the model is a good fit to the data, then the points
should form a straight line – departures from the line indicate departures from the model.
6.2
Definitions
Suppose that random variable X follows a normal distribution with mean µ and variance σ 2 then it
has probability density function
1 (x − µ)2
1
,
−∞ < x < ∞.
exp −
f (x) = √
2 σ2
2πσ 2
and we might use the shorthand notation X ∼ N (µ, σ 2 ). The cumulative distribution function (CDF)
cannot be evaluated as an explicit equation but must be evaluated numerical.
The normal density is unimodal (with mode at its mean) and symmetric about its mean. Hence its
mean, median and mode are all equal. We can say that E[X] = µ and Var(X) = σ 2 , but also that
the coefficient of skew, E[(X − µ)3 /σ 3 ] = 0, that it is symmetric, also that the coefficient of kurtosis
E[(X − µ)4 /σ 4 ] = 3 (the excess kurtosis is defined as zero).
If normal random variable Z has mean equal to zero and variance equal to one, then we have the
standard normal distribution, Z ∼ N (0, 1). The PDF is sometimes given the notation f (z) = φ(z)
and the CDF F (z) = Φ(Z). Note that if X ∼ N (µ, σ 2 ) then (X − µ)/σ ∼ N (0, 1).
22
I NTRODUCING P ROBABILITY AND S TATISTICS
6.3
6 T HE N ORMAL D ISTRIBUTION
Transformations to normality
Many data sets are in fact not approximately normal. However, an appropriate transformation of a
data set can often yield a data set that does follow approximately a normal distribution. This increases
the applicability and usefulness of statistical techniques based on the normality assumption. The
Box-Cox transformation is a particularly useful family of transformations defined as:
λ
(x − 1)/λ forλ 6= 0
T (x; λ) =
log(x)
forλ = 0
where λ is an transformation parameter. There are several important special cases: (i) S QUARE ROOT
TRANSFORMATION with λ
square root. (ii) L OG
= 12 . If necessary, make all positive by adding a constant before taking the
TRANSFORMATION
with λ = 0. Again, it may be necessary to add a constant
to make all values positive before taking logs. (iii) I NVERSE TRANSFORMATION with λ = −1.
Notice that simply inverting the values would make small numbers large, and large numbers small
- this transformation would reverse the order of the values and great care would be needed in the
interpretation. This is not a problem with the Box-Cox transform as the ordering of the values will
be identical to the original data. Data transformations are valuable tools, offering many benefits but
greater care must be used when interpreting results based on transformed data.
6.4
Approximating distributions
Under certain conditions some probability distributions can be approximated by other distributions.
Historically, this was important as it gave an easy way to perform probability calculations, but also
it helps us to understand relationships between distributions and later to understand transformations
from one distribution to another.
The P OISSON APPROXIMATION TO THE BINOMIAL works well when n is large and p is small. As
a rule of thumb we might consider the approximation satisfactory when, say, n ≥ 20 and p ≤ 0.05
(alternatively when n is large and the expected number of “successes” is small, that is np ≤ 10 say).
Another way to think of this is that the Poisson will work well when we are modelling rare events
in a very large population. The N ORMAL APPROXIMATION TO B INOMIAL is reasonable when n is
large, and p and (1 − p) are NOT too small, say np and n(1 − p) must be greater than 5. Note that
the conditions of Poisson approximation to Binomial are complementary to the conditions for Normal
Approximation of Binomial Distribution.
Perhaps the most powerful result is the CENTRAL LIMIT THEOREM (CLT). Suppose we have a random
sample, X1 , X2 , . . . , Xn , from any distribution with finite mean, E[X], and variance, Var(X), then
the CLT says that, as the sample size n tends to infinity, the distribution of any sample mean, X̄, tends
to the normal with mean E[X] and variance equal to Var(X)/n.
23
I NTRODUCING P ROBABILITY AND S TATISTICS
7
7.1
7 D ERIVED DISTRIBUTIONS
Derived Distributions
Introduction
Initially it may seem that each probability distribution is unrelated to any other distribution, but in fact
many are related. As simple cases the binomial, geometric and negative binomial are all generated by
repeated Bernoulli trials. There are other examples where one random variable can be derived as a
transformation of another, or where one random variable is obtained as a sum of others. Perhaps the
most widely used transformations involve the normal distribution, such as linear functions, or the sum
of squared normal random variables. Less obviously, a normal random variable divided by a sum of
squared normal random variables or the ratio of two sums of squared normal random variables. Each
of these corresponds to a common example and the answers should be familiar distributions. In the
next sections we will see the mathematical techniques needed to derive many of these results.
7.2
Functions of a random variable
For discrete random variables transformations are straightforward. Assuming that the range space
and probability mass function of the original random variable are known, then the range space for
the transformed random variable can easily be deduced then the probability can be transferred to the
elements of the new range space, using an argument of equivalent events.
The corresponding treatment of continuous random variables is not so straightforward. We are not
simply reallocating probability masses from elements in one range space to elements of another. In
this situation, we are dealing with the more subtle concept of density of probability. The simplest
approach is to calculate the (cumulative) distribution function of the transformed variable directly,
and then differentiate to obtain the density function.
Example Consider an exponential random variable X with parameter λ, X ∼ exp(λ), and let
Y = X 2 , then
FY (y) = P r(Y ≤ y) = P r(X 2 ≤ y) = P r(X ≤
√
√
√
y) = FX ( y) = 1 − e−λ y
Now differentiate to give the density function
fY (y) =
√ √
d
d
λ
Fy (y) =
1 − e−λ y = √ e−λ y
dy
dy
2 y
and the range space of Y is SY = [0, ∞). Note that this density function is unbounded at the origin,
unlike the original density function. Although, normally, y = x2 is not regarded as a one-to-one
function, over the range space SX it is behaving as one-to-one.
F URTHER READING: Sections 2.3, 3.6 and 4.5 of Rice.
24
I NTRODUCING P ROBABILITY AND S TATISTICS
7 D ERIVED DISTRIBUTIONS
Result
Let X be a continuous random variables with p.d.f. fX (). Suppose that g() is a strictly monotonic
function, then the random variable Y = g(X) has p.d.f. fY given by:



d
−1
−1

 fX (g (y)) dy (g (y))
fY (y) =




0
y = g(x) for some x
y 6= g(x) for any x.
If y = g(x) is not monotonic over the range of X, we split the range into parts for which the function
is monotonic (one-to-one relation holds)
fY (y) =
X
fX
d
−1
−1
g (y) (g (y))
dy
y = g(x)
where the sum is over the separate parts of the range of X for which x and y are in one-to-one
correspondence.
Example Consider a random variable X with parameter λ, which has the following p.d.f.
fX (x) =
λ
exp (−λ|x|) ,
2
−∞ < x < ∞.
This density function looks like two exponential functions place back-to-back, and hence is often
referred to as the double exponential, or less descriptively as the Laplace distribution.
Consider the transformation Y = X 2 , clearly with this range space for X, the transformation is not
one-to-one. However, by dividing the range into two −∞ < x < 0 and 0 < x < ∞, y = x2 is
monotonic over each half separately.
√
1
√ For (−∞, 0), X = − Y , dx/dy = − 12 y − 2 and fX (x) = λ2 exp (λx), hence fY (y) = 4√λ y exp −λ y .
√
1
√ For (0, ∞), X = Y , dx/dy = 12 y − 2 and fX (x) = λ2 exp (−λx), hence fY (y) = 4√λ y exp −λ y .
Summing the parts from the two ranges gives
λ
√
fY (y) = √ exp (−λ y)
2 y
y ≥ 0.
The same distribution as in the earlier example involving the exponential distribution and the transformation Y = X 2 .
25
I NTRODUCING P ROBABILITY AND S TATISTICS
7.3
7 D ERIVED DISTRIBUTIONS
Transforming bivariate random variables
Suppose we wish to find the joint probability density function of a pair of random variables, Y1 and Y2 ,
which are given functions of two other random variables, X1 and X2 . Further that Y1 = g1 (X1 , X2 )
and Y2 = g2 (X1 , X2 ), and that the joint probability density function of X1 and X2 is fX1 ,X2 (x1 , x2 ).
We assume the following conditions:
(I) The transformation (x1 , x2 ) 7→ (y1 , y2 ) is one-to-one. That is we can solve the simultaneous
equations y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ), for x1 and x2 to give x1 = h1 (y1 , y2 ) and
x2 = h2 (y1 , y2 ) (say). Transformations which are not one-to-one can be handled, but are more
complicated except in special cases – such as for sums of independent random variables.
(II) The functions h1 and h2 have continuous partial derivatives and the Jacobian determinant is
everywhere finite (that is |J| < ∞) where
"
#
J = det
∂x1
∂y1
∂x2
∂y1
∂x1
∂y2
∂x2
∂y2
Note that there are other ways to write this, ‘all’ are equivalent.
Then,
fY1 ,Y2 (y1 , y2 ) = |J|fX1 ,X2 (x1 , x2 )
substituting for x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 ) where necessary. The range space for (y1 , y2 ) is
obtained by applying the inverse transformation to the constraints on x1 and x2 .
Example If X1 and X2 are independent exponential random variables each with parameter λ, then
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = λ2 eλ(x1 +x2 ) , x1 , x2 ≥ 0.
Now, if Y1 = X1 + X2 and Y2 = eX1 then x1 = h1 (y1 , y2 ) = log(y2 ) and x2 = h2 (y1 , y2 ) =
y1 − log(y2 ). Now, the Jacobian matrix is
# "
#
"
∂x1
∂x1
1
0
∂y1
∂y2
y2
=
∂x2
∂x2
1
1
−
∂y1
∂y2
y2
and so the absolute value of its determinant is |J| = 1/y2 (this is finite because it can also be shown
that y2 ≥ 1). Then
1
fY1 ,Y2 (y1 , y2 ) = λ2 eλy1 , y1 ≥ log y2 , y2 ≥ 1.
y2
7.4
Sums of independent random variables
Some results
1. If X1 , ..., Xn are independent Poisson random variables with parameters λ1 , ..., λn , then X1 +
... + Xn also has a Poisson distribution with parameter (λ1 + ... + λn ).
26
I NTRODUCING P ROBABILITY AND S TATISTICS
7 D ERIVED DISTRIBUTIONS
2. If X1 , ..., Xk are independent Binomial random variables with parameters (n1 , p), ..., (nk , p),
then X1 + ... + Xk also has a Binomial distribution with parameters (n1 + ... + nk , p).
3. If X1 , ..., Xn are independent gamma random variables with parameters (t1 , λ), ..., (tn , λ), then
X1 + ... + Xn also has a gamma distribution with parameters (t1 + ... + tn , λ).
4. If X1 , ..., Xn are independent normal random variables with parameters (µ1 , σ12 ), ..., (µn , σn2 ),
then X1 + ... + Xn also has a normal distribution with parameters (µ1 + ... + µn , σ12 + ... + σn2 ).
Direct method
If X and Y are independent random variables then the probability function for Z = X + Y is
pZ (z) =
X
pX (x)pY (z − x) =
X
x
Z
y
Z
fX (x)fY (z − x)dx =
fZ (z) =
pX (z − y)pY (y) if discrete
x
fX (z − y)fY (y)dy, if continuous.
y
Using generating functions
The above results can be derived most easily using moment generating functions (or probability generating functions for the discrete cases) using the result that if Z = X1 + ... + Xn and the X are
Q
independent then MZ (t) =
MXi (t). Of course we must be able to recognise the mgf of Z to
identify the distribution.
27
I NTRODUCING P ROBABILITY AND S TATISTICS
7.5
7 D ERIVED DISTRIBUTIONS
Distributions derived from the Normal distribution
The most frequently used techniques in statistics are the t-test and the F-test. These are used to
compare means of two or more samples and to make inferences about the population means from
which the samples were drawn. The test statistic in each case is not an arbitrary function of the data,
but is chosen to have useful properties. In particular the function is chosen so that its distribution is
known.
• If X has a standard normal distribution and independently Y has a chi-squared distribution with ν
degrees of freedom then √X has a t-distribution with ν degrees of freedom.
Y /ν
• If X1 and X2 have independent chi-squared distributions with ν1 and ν2 degrees of freedom then
X1 /ν1
has an F-distribution with degrees of freedom ν1 and ν2 .
X2 /ν2
Preliminary Results: Distribution of the mean and variance
Consider a random sample X1 , ..., Xn from a normal population with mean µ and variance σ 2 , that is
P
P
1
Xi and S 2 = n−1
(Xi − X̄)2 , then
Xi ∼ N (µ, σ 2 ), i = 1, ..., n. If we define X̄ = n1
(a) X̄ ∼ N (µ, σ 2 /n),
(b) (n − 1)S 2 /σ 2 ∼ χ2n−1 , and
(c) X̄ and S 2 are independent.
The t-distribution
Suppose we have a random sample X1 , ..., Xn from a normal population, Xi ∼ N (µ, σ 2 ), i = 1, ..., n,
X̄−µ
√ ∼ N (0, 1), whereas, if we estimate
with sample mean X̄ and variance S 2 . If σ 2 is known, then σ/
n
σ 2 by S 2 , then
X̄−µ
√
S/ n
∼ tn−1 , that is a t-distribution with degrees of freedom n − 1.
The F-distribution
Suppose we have two independent random samples of size n1 and n2 from normal populations
N (µ1 , σ12 ) and N (µ2 , σ22 ) with sample means and variances X̄1 , S12 and X̄2 , S22 . Imagine we want
to test Ho : σ12 = σ22 . If Ho is true then S12 /S22 ≈ 1; is false either S12 /S22 is large (σ12 > σ22 ) or S12 /S22
is close to zero (σ12 < σ22 ).
Now X1 = (n1 − 1)S12 /σ12 ∼ χ2n1 −1 and X2 = (n2 − 1)S22 /σ22 ∼ χ2n2 −1 thus
F =
S 2 /σ 2
S2
X1 /(n1 − 1)
= 12 12 = 12 under H0
X2 /(n2 − 1)
S2 /σ2
S2
and so F ∼ Fn1 −1,n2 −1 , an F-distribution with degrees of freedom (n1 − 1) and (n2 − 1).
F URTHER READING: Sections 3.6 and 4.5 of Rice.
28
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
Exercises
(6.1) Let Z be a standard normally distributed random variable then find: (a) P r(Z < 2), (b)
P r(−1 < Z < 1) and (c) P r(Z 2 > 3.8416).
Hint: In (c) find the probability of an equivalent event involving Z and not Z 2 .
(6.2) Suppose that X is a normally distributed random variable with mean µ and standard deviation
σ > 0, then it has MGF given by
1 22
MX (t) = exp µt + σ t .
2
Find the MGF of X ∗ = (X − µ)/σ and hence state the distribution of X ∗ .
(6.3) Let X be a normally distributed random variable with mean 10 and variance 25.
(a) Evaluate the probability P r(X 6 8).
(b) Evaluate the probability P r(15 6 X 6 20).
(6.4) Let X follow a binomial distribution and parameters n = 50 and p = 0.52.
What are the expectation and variance of X? Hence write down the normal distribution which
approximates this binomial distribution. Is this likely to be a good approximation?
Use the normal approximation, with a continuity correction, to evaluate the probability that X
is at least 30. Why is the continuity correction needed?
(7.1) If X is a continuous random variable with a uniform distribution on the interval [0, 1], that is
with PDF
fX (x) = 1 for 0 < x < 1,
then find the PDF of Y = − log(X)/λ where λ > 0. Name the distribution.
(7.2) Suppose X1 , . . . , Xn are independent normal random variables, with corresponding parameters
(µ1 , σ12 ), . . . , (µn , σn2 ), then, using MGFs, show that Sk = X1 + · · · + Xn also has a normal
distribution. What are the parameters of this new distribution?
Suppose now that the random variables are also identically distributed, that is with common
mean µ and variance σ 2 . What can be said about the distribution of X̄ = n1 Sk ?
29
I NTRODUCING P ROBABILITY AND S TATISTICS
8
8 BAYESIAN M ETHODS
Bayesian Methods
8.1
Introduction
The Bayesian approach to statistics is currently very fashionable and respectable, but this has not
always been the case! Until, perhaps, 20 years ago Bayesian statisticians were seen as extremist and
fanatical. Leading statisticians of the day considered their work was unimportant and even “dangerous”. The main reason for this lack of trust is the subjective nature of some of the modelling. The key
difference, compared to classical statistics, is the use of subjective knowledge in addition to the usual
information from data.
Suppose we are interested in parameter θ. In the standard setting, we would perform an experiment
and use the data to estimate θ. But in practice we might have some knowledge about θ before doing
the experiment and want to incorporate this prior degree of belief about θ into the estimation process.
Let π(θ) be our prior density function for θ quantifying our prior degree of belief. From the data
we can calculate the likelihood, L(X|θ). These two sources of information can be combined to give,
π(θ|X), the posterior distribution of θ reflecting our belief about θ after the experiment.
Recall Bayes Theorem defined in terms of probabilities of events (A and B say),
P (A|B) =
P (B|A)P (A)
P (A ∩ B)
=
P (B)
P (B)
The appropriate form of this for our situation is
π(θ|x) =
L(x|θ)π(θ)
p(x)
Note however that the divisor is unimportant when making inference about θ and so we can simply
say
π(θ|x) ∝ L(x|θ)π(θ)
that is
“Posterior pdf is proportional to Likelihood times prior pdf”
The Bayesian method gives a way to include extra information into the problem and can make logical
interpretation easier.
Although the approach is straightforward, there can be serious (algebraic) difficulties deriving the
posterior distribution. Also, there are many possible choices for the prior distribution - with the
chance that the final conclusion might depend on this subjective choice of prior. One approach to
choice of prior is to use a non-informative prior (such as the uniform in the following example) which
does not have an influence on the modelling, or a vague prior where the influence is mild. To make
deriving the posterior distribution easier, and to give a standard approach to choice of prior it is
common to use a conjugate prior. That is, given the likelihood, the prior is chosen so that the prior
and posterior distributions are in the same family.
30
I NTRODUCING P ROBABILITY AND S TATISTICS
8.2
8 BAYESIAN M ETHODS
Conjugate prior distributions
To be able to progress much further we must first consider two new examples of continuous distributions, the beta distribution and the gamma distribution. These are particularly important as they are
conjugate prior distributions for several widely used likelihood models. For example the beta is the
conjugate prior for all the distribution based on the Bernoulli, that is the geometric and binomial. The
gamma is the conjugate prior distribution for the Poisson and the exponential. However, for the most
widely encountered data model, the normal distribution, it is the normal distribution itself which is
the conjugate prior.
Beta distribution, β(p, q)
xp−1 (1 − x)q−1
0 ≤ x ≤ 1;
B(p, q)
R1
where B(α, β) = 0 xα−1 (1 − x)β−1 dx.
f (x) =
p, q > 0.
Notes that B(α, β) = Γ(α)Γ(β)/Γ(α + β) and that E[X] = α/(α + β) and Var[X] = αβ/{(α +
β)2 (α + β + 1)}.
As a special case, when α = β = 1, then this reduces to the continuous uniform distribution on the
interval (0, 1).
Gamma distribution, γ(α, λ)
λα xα−1 e−λx
Γ(α)
R ∞ α−1 −x
where Γ(α) = 0 x e dx.
f (x) =
x ≥ 0;
α, λ > 0.
Note that Γ(α+1) = αΓ(α) for all α, hence Γ(α+1) = α! for integers α > 1, and that Γ(1/2) =
Also, E[X] = α/λ and Var[X] = α/λ2 .
√
π.
As important special cases we have (a) when α = 1 then this reduces to the exponential distribution
with parameter λ, and (b) when α = ν/2 and λ = 2 then it becomes the chi-squared distribution with
ν degrees of freedom χ2ν .
31
I NTRODUCING P ROBABILITY AND S TATISTICS
8 BAYESIAN M ETHODS
Example: Coin tossing: Let θ be the probability of getting a head with a biased coin. In n tosses
of the coin we observe X = x heads, then
n x
p(x|θ) =
θ (1 − θ)n−x ,
x = 0, 1, 2, . . .
x
Now suppose we only know that θ is on the probability scale, and so we have a uniform prior,
π(θ) = 1,
0 < θ < 1.
Now posterior is proportional to likelihood times prior,
n x
π(θ|x) ∝ p(x|θ)π(θ) =
θ (1 − θ)n−x . 1 ∝ θx (1 − θ)n−x
x
Notice that this is the form of a Beta distribution, that is it depends on the variable, θ, in the correct
way. Hence the posterior distribution is Beta and we can identify the parameters as p = x + 1 and
q = n − x + 1, that is θ|x ∼ β(x + 1, n − x + 1). We can now write-down the pdf
θx (1 − θ)n−x
.
π(θ|x) =
B(x + 1, n − x + 1)
8.3
Point and interval estimation
In classical statistics we have been interested in estimating a parameter θ. This can also be done in
Bayesian statistics. Recall that the posterior distribution contains all the information about θ - hence
we based all our estimation on the posterior pdf.
Natural estimators of θ are: the Posterior Mean or Bayes Estimator that is E[θ|X = x], and the
Posterior Mode or Maximum a Posterior (MAP) Estimator. The MAP estimator is the most likely
value of θ and is the analogue of the maximum likelihood estimator.
To reflect the precision in this estimation we can construct a credibility interval (the equivalent of
the classical confidence interval). A 100(1 − α)% credibility interval for θ can be found using the
probability statement
P r(θL ≤ θ ≤ θU ) = 1 − α
This can be interpreted as, the probability of θ being inside the interval is 1 − α (this is much more
intuitive than the interpretation of the classical confidence interval).
On its own this does not give a unique definition of the interval and so we can introduce the extra
condition that says
P r(θ ≤ θL ) = P r(θ ≥ θU ) = α/2
this is called the equal-tailed interval.
Example: Coin tossing (Continued): Since the posterior pdf is a Beta distribution we already
know equations for the two point estimators: the mean is (x + 1)/(n + 2) and the mode is x/n (which
is the same as the MLE).
32
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
Exercises
(8.1) Suppose that we have a single observation x from a Poisson distribution with parameter θ.
Derive the posterior distribution, π(θ|x), when the prior distribution of θ is a Gamma(a, b)
distribution. Show that this is a conjugate prior.
Write down the posterior mean θ̄ = E[θ|X = x] and the maximum a posterior (MAP) estimator
θb = arg max π(θ|X = x).
For observation x = 3, and prior parameters a = 2 and b = 0.7, what is the corresponding
posterior distribution. Draw a graph of the prior and posterior distributions and comment. Find
the posterior mean and the MAP estimate.
(8.2) The number of defective items, X, in a random sample of n has a Binomial distribution where
the probability that a single item is defective is θ (0 < θ < 1). If the prior distribution of θ is the
Beta distribution with parameters α and β, obtain the posterior distribution of θ given X = x.
Determine the posterior mean E[θ|X = x].
In a particular case it is found that: n = 25, x = 8 and the prior belief about θ can be summarised by a distribution with prior mean 12 and prior standard deviation 14 .
Determine the posterior mean µ
b = E[θ|X = x] and obtain the posterior standard deviation
(which gives an indication of the precision of the posterior mean).
(8.3) Suppose that we have a single observation x from an exponential distribution with parameter θ.
Consider a Gamma(a, b) distribution as a prior for θ. Derive the posterior distribution, π(θ|x),
and show that this is a conjugate prior.
Write down the posterior mean θ̄ = E[θ|X = x] and the maximum a posterior (MAP) estimator
θb = arg max π(θ|X = x).
With data x = 4.8, and prior parameters a = 10 and b = 1.5, what is the corresponding posterior distribution. Find the posterior mean and the MAP estimate. Also calculate the posterior
standard deviation. Comment.
33
I NTRODUCING P ROBABILITY AND S TATISTICS
E XERCISES
34
I NTRODUCING P ROBABILITY AND S TATISTICS
P RACTICAL E XERCISES
Practical Exercises to be Completed Using R
Tomorrow, you will meet the R statistical programming environment. Once you are familiar with R,
you might like to try out the exercises below.
The following simple exercises will allow you to check some of your early answers, but will also
require use of a range of R functions. As well as performing more complicated statistical analyses, R
is very useful for performing calculations and for plotting graphs. Over the page is a more complicated
example where, although the individual calculations are simple, it would be too time consuming to
perform by-hand.
1. In Exercise (1.1), evaluate the expected value and variance of X.
Hint: Define vectors for x and the probabilities, take element-wise product then sum.
2. In Exercises (1.2), plot a graph of the probability density function.
Hint: Use the curve command.
3. In Exercises (3.1), evaluate the two probabilities that the student passes the exam.
Hint: Use the pbinom command.
4. In Exercises (5.1), evaluate the fitted frequencies using the estimated value of p.
Hint: Use the dbinom command.
5. In Exercises (5.2), calculate the test statistic and the corresponding p-value.
Hint: Use the pt command, or the t-test command.
6. In Exercises (6.1), calculate the three probability values for the standard normal.
Hint: Use the pnorm command.
7. In Exercises (6.3), calculate the two probability values for the normal random variable with
mean 10 and variance 25.
Hint: Use the pnorm command giving the mean and standard deviation.
8. In Exercises (6.4), calculate the exact binomial probability and compare to the previously found
approximation.
Hint: Use the pbinom command.
9. In Exercises (7.1), simulate some data from the continuous uniform distribution, and draw a
histogram. Does this look consistent with the exponential?
Hint: Use the runif and hist commands.
10. In Exercises (7.2), simulated two, equal-sized, samples each from a different normal distribution, and calculate the element-wise sum. Draw a histogram and evaluate the mean and
variance. Are these consistent with the theoretical result?
Hint: Use the rnorm, hist, mean and var commands.
35
I NTRODUCING P ROBABILITY AND S TATISTICS
P RACTICAL E XERCISES
Extended Practical Exercise
Suppose that 100 people are subject to a blood test. However, rather than testing each individual
separately (which would require 100 tests), the people are divided into groups of 5 and the blood
samples of the people in each group are combined and analysed together. If the combined test is
negative, one test will suffice for the whole group; if the test is positive, each of the 5 people in
the group will have to be tested separately, so that overall 6 tests will be made. Assume that the
probability that an individual tests positive is 0.02 for all people, independently of each other.
In general, let N be the total number of individuals, n be the number in each group (with k = N/n
the number of groups), and p the probability that an individual tests positive.
Consider one group of size n, and let Ti represent the number of test required for the ith group
(i = 1, . . . , k). The combined test is negative, and hence one test will be sufficient, with probability
P r(Ti = 1) = P r(Combined test is negative) = (1 − p)
otherwise it is positive, and n + 1 test are required, with probability
P r(Ti = n + 1) = P r(combined test is positive) = 1 − (1 − p)n .
Now the expected number of tests for the ith group is
E[Ti ] = 1 × P r(Ti = 1) + (n + 1) × P r(Ti = n + 1)
= (1 − p)n + (n + 1) (1 − (1 − p)n ) = (n + 1) − n(1 − p)n .
The expected total number of test, E[T ], is then given by the sum of the expected numbers for each
group
E[T ] = E[T1 ] + · · · + E[Tk ] = k × ((n + 1) − n(1 − p)n ) .
For the given values, N = 100, n = 5, p = 0.05, the expectations are E[Ti ] = 6−5 (0.98)5 ≈ 1.4804.
and therefore, E[T ] = 20 × 1.4804 ≈ 29.6079 ≈ 30. So on average only 30 tests will be required,
instead of 100.
But, for the given total number of people, is this the best choice of n, and what happens as p varied?
Use R to repeat the above calculations, then try other values of n (which lead to integer k) to see if
n = 5 gives the smallest expected total number of tests. Repeat this process p = 0.01 and p = 0.5,
and comment on the best choice of group size.
If possible, produce a line graph of expected total number of tests, E[T ], against group size, n, with
separate lines for different values of p. Also, a graph of the optimal choice of n against p. Comment
on these graphs.
36
I NTRODUCING P ROBABILITY AND S TATISTICS
P RACTICAL E XERCISES
Solutions to Practical Exercises
1. >
>
>
>
>
>
>
x=c(-3,-1,0,1,2,3,5,8)
p=c(0.1,0.2,0.15,0.2,0.1,0.15,0.05,0.05)
xp = x*p
x2p = x**2*p
ex = sum(xp)
ex2 = sum(x2p)
ex2 - ex**2
2. > curve((3/4)*x*(2-x),0,2)
3. > 1-pbinom(3,20,1/5)
> 1-pbinom(3,20,1/3)
4. >
>
>
>
x=c(0,1,2,3,4,5)
f=c(8,40,88,110,56,18)
p=sum(x*f/320)/5
dbinom(0:5,5,p)*320
x=c(0.05,0.29,0.39,-0.18,0.11,0.15,0.35,0.28,-0.17,0.07)
xm = mean(x)
xsd = sd(x)
tobs = (xm-0)/(xsd/sqrt(10))
2*(1-pt(2.11,9))
5. >
>
>
>
>
>
> t.test(x)
6. > pnorm(2)
> pnorm(1)-pnorm(-1)
> pnorm(1.96)-pnorm(-1.96)
7. > pnorm(20,10,5)-pnorm(15,10,5)
> pnorm(8,10,5)
8. > 1-pbinom(29,50,.52)
9. >
>
>
>
>
>
>
>
>
10. >
>
>
>
>
>
>
>
>
>
>
>
>
x=runif(100)
y=-log(x)
hist(y)
mean(y)
x=runif(1000)
y=-log(x)/5
hist(y)
mean(y)
x=rnorm(1000)
hist(x)
mean(x)
var(x)
y=rnorm(1000,10, 5)
mean(y)
var(y)
z=x+y
hist(z)
mean(z)
var(z)
sd(z)
Extended Exercise
> N=100
> n=5
> p=0.02
> k=N/n
> eT = k*((n+1)-n*(1-p)**n)
>
>
> ns = 1:20
> eT = rep(0,length(ns))
>
> p=0.02
>
> for (i in 1:length(ns)){
+ k=N/ns[i]
+ eT[i] = k*((ns[i]+1)-ns[i]*
(1-p)**ns[i])
+ }
>
> plot(ns,eT)
37
I NTRODUCING P ROBABILITY AND S TATISTICS
P RACTICAL E XERCISES
38
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
Solutions to Exercises
(1.1) Clearly all probabilities are between 0 and 1, and they sum to 1. Hence they define a valid
probability distribution.
[Note that only checking that the probabilities sum to 1 is not sufficient, as for example both
(3/5, 3/5, −1/5) and (1/4, 5/4, −1/2) sum to 1, but violate other conditions.]
For the first two, add the appropriate probabilities to give 0.45 and 0.3 respectively.
To calculate the means and variances we extend the probability table with two extra rows:
x
pX (x)
xpX (x)
x2 pX (x)
-3
0.1
-0.3
0.90
-1
0.2
-0.2
0.20
0
1
2
3
5
0.15 0.2 0.1 0.15 0.05
0
0.2 0.2 0.45 0.25
0
0.2 0.4 1.35 1.25
8
0.05
0.4
3.2
P
P 2
Summing the last two rows, we obtain E[X] =
xpX (x) = 1 and E[X 2 ] =
x pX (x) =
7.5, and so V ar(X) = E[X 2 ] − (E[X])2 = 7.5 − 12 = 6.5.
(1.2) First note that for fX (x) = cx(2 − x) to be a valid density we require the p.d.f. to be always
non-negative. Clearly, for 0 ≤ x ≤ 2 we requireRc ≥ 0, and note that for x outside this range
∞
fX (x) = 0 by definition. Also using the fact that −∞ fX (x)dx = 1:
Z
∞
Z
fX (x)dx = c
−∞
0
2
2
(2x − x2 )dx = c x2 − x3 /3) 0 = 4c/3
Recall the definition of the c.d.f.: FX (x) = P (X ≤ x) =
Rx
−∞
hence c = 3/4.
fX (y)dy.
[To avoid possible confusion between the variable over which we are integrating and the upper
limit of integration, it is usually safest to re-label one of them.]
Note that if x < 0 then FX (x) = 0, and if x > 2 then FX (x) = 1. If 0 ≤ x ≤ 2 then
Z x
Z x
x3
3
2
x −
.
fX (y)dy =
fX (y)dy =
FX (x) =
4
3
−∞
0
As a result we can write:
FX (x) =


 0
3
4


2
x −
x3
3
1
for x < 0,
for 0 ≤ x ≤ 2,
for x > 2.
[Make sure that you define the cumulative distribution function for all real values and include
FX (x) = 0 and FX (x) = 1 in the answer.]
Straight from the c.d.f. we have P (X > 1) = 1−P (X ≤ 1) = 1−FX (1) = 1− 34 1 − 31 = 12 .
39
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
0.4
0.5
(2.1) First find the marginal of Y by summing over the values of x to give:
0.0
0.1
0.2
f(y)
0.3

4/32 y = −2,



12/32 y = −1,
pY (y) =
12/32 y = 0,



4/32 y = 1.
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
0.8
0.6
F(y)
0.4
0.2
0.0
Then the cumulative distribution function using
FY (y) = P r(Y ≤ y) gives:

0
y < −2,




 4/32 −2 ≤ y < −1,
16/32 −1 ≤ y < 0,
FY (y) =


28/32
0 ≤ y < 1,



1
1 ≤ y.
1.0
y
−4
−3
−2
−1
0
1
2
3
y
[Notice that we must define the cdf for all real numbers even though we are only really interested
in the central part. Marks would be lost for missing these “extreme” values.]
For the conditional distribution, first find the marginal of X by summing over y to give:
x
pX (x)
1
8/32
2
8/16
3
8/32
Then use: pY |X (y|x) = p(x, y)/pX (x) to give:
x
1
2
3
Value of y
-2 -1 0
1/8 3/8 3/8
1/8 3/8 3/8
1/8 3/8 3/8
1
1/8
1/8
1/8
[Here the p.m.f.s are shown as tables – compare to (a) above – either approach is fine. Also
each row of the conditional probabilities table is a probability distribution and so sums to 1.]
Since the conditional distribution of Y given X = x does not depend on x (equivalently, the
conditional distribution is equal to the marginal), then X and Y are independent.
40
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
(2.2) First the marginal of X by integrating over the variable we do not want, i.e. over y:
2.5
6 2 xy x +
dy
2
0 7
2
xy 2
6
2
x y+
=
7
4
0
6
2
2x + x , 0 ≤ x ≤ 1.
=
7
0.8
2
0.4
f(y)
0.0
0.5
0.2
1.0
f(x)
1.5
0.6
2.0
fX (x) =
Similarly,
fY (y) = 76 ( 13 + y4 ), 0 ≤ y ≤ 2.
0.0
Z
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.5
x
1.0
1.5
2.0
1.5
2.0
y
x<0
1.0
0.6
F(x)
0.6
+
0≤x≤1
1
1<x
h
i
2
FY (y) = 76 y3 + y8 , 0 ≤ y ≤ 2.
FX (x) =
0.0
0.2
0.4




F(y)
x2
2
0.4
2x3
3
0.2
0.0
0
6
7
0.8





0.8
1.0
Now the cdf:
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.5
x
1.0
y
The expectation,
Z
E[X] =
0
1
6
6
x (2x2 + x)dx =
7
7
1
6 2x4 x3
6 2 1
5
(2x + x )dx =
=
+
+
= .
7 4
3 0 7 4 3
7
3
0
2
Z Z
6 2 xy
6 1 2 4 x3 y x 2 y 2
E[X(X − Y )] =
(x − xy) (x + )dydx =
(x −
−
)dydx
7
2
7 0 0
2
2
0
0
2
Z Z
6 1 4
x3 y 2 x2 y 3
6 1 4
4x2
=
−
(2x − x3 −
)dx
x y−
dx =
7 0
4
6 0
7 0
3
1
6 2x5 x4 4x3
53
=
−
−
=−
7 5
4
9 0
210
Z
1
1
Z
Z
2
2
Z
E[X|Y = 1] =
1
Z
xfX|Y (x|y)dx =
0
1
x
0
f (x, y)
dx
fY (y)
so we must first evaluate the marginal density of Y ,
1
Z 1 6 2 xy 6 x3 x2 y
6 1 y
fY (y) =
x +
dx =
+
=
+
2
7 3
4 0 7 3 4
0 7
and fY (1) = 12 , so
Z
E[X|Y = 1] =
0
1
1
6/7(x2 + x/2)
12 x4 x3
5
x
dx =
+
=
1/2
7 4
6 0 7
41
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
(2.3) Let D be the event that the tested person has the disease and B the event that his/her test result
is positive. Then, according to the information given in the question, we have
P r(B|D) = 0.8,
P r(B|Dc ) = 0.05,
P r(D) = 0.004.
Using Bayes formula, we obtain
P r(D|B) =
P r(D ∩ B)
P r(B|D) · P r(D)
=
P r(B)
P r(B|D) · P r(D) + P r(B|Dc ) · P r(Dc )
=
0.8 · 0.004
≈ 0.0604.
0.8 · 0.004 + 0.05 · 0.996
Remark. This probability may look surprisingly small. An explanation may be as follows.
Since 0.4% of the population actually have the disease, it follows that, on average, 40 persons
out of every 10,000 will have it. The test will (on average) successfully reveal the disease
in 40 · 0.8 = 32 cases. On the other hand, for 9,960 healthy persons, the test will state that
about 9, 960 · 0.05 ≈ 498 of them are ‘ill’. Therefore, the test appears to be positive in about
32 + 498 = 530 cases, but the fraction of those who actually have the disease is approximately
given by 32/530 ≈ 0.0604.
(3.1) The exam results can be modelled by Bernoulli trials with probability of success p = 1/5 in part
(a) and p = 1/3 in part (b). If X is the number of correct answers, then X has the distribution
Bin(n = 20, p) with probabilities
20 k
P r(X = k) =
p (1 − p)20−k k = 0, . . . , 20.
k
Noting that 20% of 20 is 4, the probability to pass the exam is given by
P r(X ≥ 4) = 1 − P r(X < 4) = 1 −
3
X
P r(X = k).
k=0
The results are shown in the table:
p
1/5
1/3
P r(X = 0) P r(X = 0) P r(X = 2) P r(X = 3) P r(X ≥ 4)
0.0115
0.0576
0.1369
0.2054
0.5886
0.0003
0.0030
0.0143
0.0429
0.9396
giving the answers: (a) 0.5886 (b) 0.9396.
(3.2) (a) For the exponential the c.d.f. is
P r(X ≤ x) = FX (x) = 1 − e−λx
x≥0
and so
P r(X > x) = 1 − P r(X ≤ x) = 1 − 1 − e−λx = e−λx .
Hence, with λ = 2,
1
1
Pr X >
= e−2× 2 = e−1 = 0.3679 (4 s.f.)
2
42
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
(b) We require x such that FX (x) = 12 (note that this is the median), that is x such that 1−e−2x =
1
hence 21 = e−2x and − log 2 = −2x, so x = 12 log 2 = 0.3466 (4 s.f.)
2
(c) From the definition of conditional probability:
1
1
1
= Pr X > 1 ∩ X >
/P r X >
,
Pr X > 1 | X >
2
2
2
but note that
1
(X > 1) ⊂ X >
2
and so
1
(X > 1) ∩ X >
= (X > 1).
2
Hence we require
e−2
P r(X > 1)
e−2×1
=
=
= e−1 = 0.3679 (4 s.f.)
1
1
−1
−2×
e
Pr X > 2
2
e
(3.3) The moment generating function of the Poisson distribution is found as follows.
X (λet )x
λx e−λ
= e−λ
x!
x!
t
X (λet )x e−λe
t
t
= eλ(e −1)
= eλ(e −1)
x!
MX (t) = E[etX ] =
X
etx
where the sum is of the P o(λet ) (check this) and so is equal to 1.
[Note that, as with the derivation of the binomial m.g.f. here we could directly use the series
expansion of the exponential.]
Here differentiating and setting t = 0 gives:
dMX (t)
t
= λet eλ(e −1)
dt
d2 MX (t)
t
t
= λet eλ(e −1) + (λet )2 eλ(e −1)
2
dt
and so E[X] = λ, E[X 2 ] = λ + λ2 . Hence V ar(X) = λ.
The moment generating function of the Poisson distribution, Xi is
t
MXi (t) = eλi (e −1)
and so, if Sk = X1 + ... + Xn , then the moment generating function of Sk is
MSk (t) =
k
Y
MXi (t) =
i=1
k
Y
t
P
eλi (e −1) = e
λi (et −1)
i=1
with is the moment
generating function of a Poisson random variable with parameter
P
P
is Sk ∼ P o( λi ). And hence the mean and variance of Sk are both equal to λi .
P
λi , that
[In the last two questions, we see the power of the moment generating function. We are producing important results without too much difficulty.
43
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
(4.1) The coding centres the x-values, meaning that
P
xi = 0, and hence x̄ = 0.
The remaining calculations are as follows.
Fig: Scatterplot of sugar remaining against
Coded temp., x x2i
xi y i
coded temperature, with fitted equation.
-0.5
0.25 -4.05
-0.4
0.16 -3.12
-0.3
0.09 -2.55
-0.2
0.04 -1.96
-0.1
0.01 -0.95
0.0
0.00 0.00
0.1
0.01 0.86
0.2
0.04 2.04
0.3
0.09 2.79
0.4
0.16 3.68
0.5
0.25 5.25
−0.4
−0.2
0.0
0.2
0.4
0.0
1.10 1.99
x
P
P
Since y = 9.13 and Sxy /Sxx = xi yi / x2i = 1.81 (to 2 dp) we get the regression equation
8.0
8.5
9.0
y
9.5
10.0
10.5
Sugar, y
8.1
7.8
8.5
9.8
9.5
8.9
8.6
10.2
9.3
9.2
10.5
100.4
sugar = 9.13 + 1.81 × coded temp
where “sugar” is the sugar remaining after fermentation and “coded temp” is the fermentation
temperature minus 21.5 degrees centigrade.
Alternatively, we could give the regression equation as
sugar = −29.77 + 1.81 × temp
where “temp” is in degrees centigrade. Either form is correct, but you need to be clear as to
which form you have used.
(4.2) Re-arranging the model gives us εi = yi − α − β1 xi − β2 wi (i = 1, . . . , n). Hence the sum of
squared errors is
X
X
S=
ε2i =
(yi − α − β1 xi − β2 wi )2 .
i
i
Differentiating S w.r.t. α, we get
X
∂S
= −2
(yi − α − β1 xi − β2 wi )
∂α
i
= −2(ny − nα)
P
P
since i xi = i wi = 0 as these variables are already centred. Setting this differential to zero
when α = α
b, we get α
b = y (as in the one-predictor case).
44
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
To find βb1 , we set ∂S/∂β1 to zero when β1 = βb1 :
X
0=
xi (yi − α − βb1 xi − β2 wi )
i
= Sxy − α
X
xi − βb1 Sxx − β2 Swx
i
⇒ Sxy = βb1 Sxx + β2 Swx
1
(Sxy − β2 Swx ).
⇒ βb1 =
Sxx
This leaves us with one equation in two unknowns. To get round this, we substitute βb2 for β2 .
Hence we need to find βb2 by differentiating S w.r.t. β2 to get βb2 = (Swy − βb1 Swx )/Sww . With
this substitution, we get
Swx Swy − βb1 Swx
Sxx βb1 = Sxy −
Sww
2
Swx Swy b Swx
= Sxy −
+ β1
Sww
Sww
−1 2
Swx
Swx Swy
b
⇒ β1 = Sxx −
Sxy −
Sww
Sww
−1
2
Swx
Sxy
Swx Swy
= 1−
−
Sww Sxx
Sxx Sww Sxx
as required.
(5.1) Each child is male or female with some fixed (but unknown) probability, and we are considering
families of five children so a suitable model is Binomial, X ∼ B(m = 5, p). We need also to
assume independence of children within a family.
To estimate the probability, p, use x̄/m = 0.5375. The corresponding fitted frequencies are:
6.8, 39.4, 91.5, 106.3, 61.8, 14.4; which are pretty close to the observed frequencies.
(5.2) Let X be the daily log exchange rate, and we are told that X ∼ N (µ, σ 2 ) is a acceptable model.
To estimate the unknown parameters we use µ
b = x̄ = 0.134 and σ
b = sn−1 = 0.2002. To test
the given hypothesis√
we use the t-test (as the population
variance is unknown) with test statistic
√
tobs = (x̄ − µ0 )/(s/ n) = (0.134 − 0)/(0.2002/ 10) = 2.1. For a 5% test, the critical value
is tcrit such that P r(Tn−1 > tcrit ) = 0.025 where T follows a t-distribution with n − 1 = 9
degrees of freedom. From the tables, P r(T9 > 2.262) = 0.025. In our case tobs is not greater
that tcrit and hence there is not sufficient evidence to reject the null hypothesis. (From R the
p-value is 0.06342, hence the same conclusion.)
(6.1) From the statistical tables: (a) 0.9772, (b) P r(−1 < Z < 1) = P r(Z < 1) − P r(Z < −1) =
P r(Z < 1) − (1 − P r(Z < 1)) = 2 × P r(Z < 1) − 1 = 2 × 0.8413 − 1 = 0.6826, and
(c)P r(Z 2 > 3.8416) ≡ P r(−1.96 < Z < 1.96) = 2 × P r(Z < 1.96) − 1 ≈ 2 × P r(Z <
1.95) − 1 = 2 × 0.9744 − 1 = 0.9588. (Note that retaining 1.96 the answer is 0.95.)
45
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
(6.2) With the given MGF, MX (t) = exp µt + 12 σ 2 t2 . and using Result 4 in Section 3.2 with
a = 1/σ and b = −µ/σ, then the MGF of X ∗ is
1 2
2
MX ∗ (t) = exp {−µt/σ} MX (t/σ) = exp {−µt/σ} × exp µt/σ + σ (t/σ)
2
2
1 t
1 2
= exp −µt/σ + µt/σ + σ 2 2 = exp
t .
2 σ
2
Which is of the same form as the original MGF but with mean zero and unit variance, hence
X ∗ ∼ N (0, 1) by the uniqueness of MGF.
(6.3) Let X ∼ N (µ = 10, σ 2 = 25) and Z ∼ N (0, 1).
To evaluate the probabilities stated in (a) and (b), we must first standardise X so we can refer
to the standard normal table. From the lecture notes, we know that X = σZ + µ. So:
x−µ
x−µ
P r(X 6 x) = P r(σZ + µ 6 x) = P r Z 6
=Φ
,
σ
σ
where Φ(z) = FZ (z) = P r(Z 6 z).
(a) To evaluate P r(X 6 8), we first must standardise:
8 − 10
8−µ
=Φ
= Φ(−0.4).
P r(X 6 8) = P r Z 6
σ
5
As Φ(−z) = 1 − Φ(z),
Φ(−0.4) = 1 − Φ(0.4) = 1 − 0.6554 = 0.3446.
(b) We can rewrite P r(15 6 X 6 20) in terms of the following cumulative probabilities:
P r(15 6 X 6 20) = P r(X 6 20) − P r(X 6 15).
Note: When considering continuous random variables, P r(X 6 x) = P r(X < x).
The next step is to standardise:
20 − 10
20 − µ
=Φ
= Φ(2).
P r(X 6 20) = P r Z 6
σ
5
15 − µ
15 − 10
P r(X 6 15) = P r Z 6
=Φ
= Φ(1).
σ
5
From the normal tables, we find that Φ(2) = 0.9772 and Φ(1) = 0.8413. Hence:
P r(15 6 X 6 20) = 0.9772 − 0.8413 = 0.1359.
(6.4) We are told that
X ∼ Bin(n = 50, p = 0.52)
and the normal approximation of this distribution is N (µ = np, σ 2 = np(1 − p)), so X is
approximated by Y where
Y ∼ N (µ = 26, σ 2 = 12.48).
46
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
As we are approximating a discrete distribution with a continuous distribution, we must apply
the continuity correction so that P r(X > 30) = P r(Y > 29.5).
As in the previous question, we must standardise so we can use the normal tables. Again let
Z ∼ N (0, 1), consider the symmetric property Φ(−z) = 1 − Φ(z) and note that P r(Y 6 y) =
P r(Y < y) when we consider continuous distributions. Then
29.5 − 26
29.5 − µ
=Φ √
= Φ(0.9907 . . .).
P r(Y 6 29.5) = P r Z 6
σ
12.48
Using interpolation (as described beside the normal table):
0.9907 . . . − 0.95
Φ(0.9907...) ≈ 0.8289 +
(0.8413 − 0.8289) = 0.8390.
1 − 0.95
Therefore P r(Y > 29.5) = 1 − 0.8390 = 0.161.
(7.1) Here X ∼ U (0, 1), with y = − log(x/λ) (a monotonic transformation) so x = e−λy , |dx/dy| =
| − λe−λy | = λe−λy , hence fY (y) = λe−λy , y ≥ 0 therefore Y ∼ exp(λ), that is Y has an
exponential distribution with parameter λ.
(7.2) Start with the moment
generating function of the normal random variable Xi , MXi (t) =
1 2 2
exp µi t + 2 σi t . The moment generating function of Sk = X1 + ... + Xn is then
k
Y
k
Y
1
MSk (t) =
MXi (t) =
exp µi t + σi2 t2
2
i=1
i=1
X
1 X 2 2
σi )t
= exp (
µi )t + (
2
This is the
P of a normal random variable with mean
P moment generatingPfunction
variance σi2 , hence Sk ∼ N ( µi , σi2 ).
P
µi and
If now the random variables have equal mean and variance then this result becomes hence
Sk ∼ N (nµ, nσ 2 ). Then, again using Result 4 in Section 3.2 with a = 1/n and b = 0, we have
(
2 )
t
1 2 t
1 σ2 2
MX̄ (t) = exp{0 × t} × MSk (t/n) = exp nµ + nσ
t
= exp µt +
n 2
n
2 n
which is the MGF of a normal random variable with mean µ and variance σ 2 /n hence X̄ ∼
N (µ, σ 2 /n).
(8.1) With a single observation x from a Poisson distribution, l(θ) = f (x|θ) = θx e−θ /x! and the
prior distribution of θ is Gamma(a, b), π(θ) = ba θa−1 e−bθ /Γ(a) θ > 0.
Therefore, the posterior distribution of θ|x is
π(θ|x) =
f (x|θ) π(θ)
f (x|θ) π(θ)
=R
f (x)
f (x|θ) π(θ)dθ
substituting gives
π(θ|x) = R
ba θa−1 e−bθ θx e−θ
Γ(a) x!
ba θa−1 e−bθ θx e−θ
dθ
Γ(a) x!
47
=R
θx+a−1 e−(b+1)θ
θx+a−1 e−(b+1)θ dθ
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
Note that the denominator (and numerator) are almost Gamma distribution – only the normalising constants are missing - with parameters a + x and b + 1. Adding the appropriate constants
gives
(b + 1)x+a θx+a−1 e−(b+1)θ
π(θ|x) =
R
x+a θ x+a−1 e−(b+1)θ
dθ
Γ(x + a) (b+1) Γ(x+a)
The integral in the denominator is that of a Gamma pdf over its full range and so have value 1.
Hence,
(b + 1)x+a θx+a−1 e−(b+1)θ
π(θ|x) =
Γ(x + a)
that is a Gamma(a + x, b + 1) distribution.
As the prior and posterior are both Gamma distributions, we have a conjugate prior here.
0.20
0.10
0.00
Density
With values given, the prior is Gamma(2, 0.7)
and the posterior Gamma(5, 1.7). These give
estimates: 2.9 and 2.4 respectively. The graph
shows that the posterior density (dashed line)
is more concentrated than the prior (solid line),
and that the mean and mode have increased,
compared to the prior values, due to the higher
data value.
0.30
For a Gamma(α, λ) the posterior mean is α/λ and the MAP (the posterior mode) is (α − 1)/λ.
Substituting in α = a + x and λ = b + 1 gives: the posterior mean (a + x)/(b + 1) and MAP
(a + x − 1)/(b + 1).
0
2
4
6
8
10
x
(8.2) For this example the prior is Beta with pdf π(θ) = θα−1 (1 − θ)β−1 /B(α, β)
the data has a Binomial distribution: X|θ ∼ Binomial(n, θ)
0 < θ < 1 and
notice that this is almost the same as one of the class examples, and hence here we will take the
approach of only looking at the functional form of the posterior, that is ignoring constants.
So the posterior is:
n x
π(θ|x) ∝ f (x|θ) π(θ) ∝
θ (1 − θ)n−x θα−1 (1 − θ)β−1 ∝ θx+α−1 (1 − θ)n−x+b−1
x
Thus, θ|x ∼ Beta(x + α, n − x + β).
The mean of a Beta(α, β) distribution is
α
and therefore, the posterior mean is
α+β
E[θ|x] =
x+α
n+α+β
Given n = 25, x = 8, and that prior has mean 21 , standard deviation 14 .
For Y ∼ Beta(α, β),
E[Y ] =
α
α+β
and
V ar[Y ] =
48
αβ
(α +
β)2 (α
+ β + 1)
I NTRODUCING P ROBABILITY AND S TATISTICS
S OLUTIONS TO EXERCISES
Therefore,
α
1
=
α+β
2
αβ
and
(α +
β)2 (α
=
+ β + 1)
1
16
From the first of these, α = β. Substituting this into the second gives,
16α2 = 4α2 (2α + 1)
Thus, α =
3
2
⇒
4 = 2α + 1
(α 6= 0)
= β.
For the above values, the posterior mean is
8 + 23
19
x+α
=
=
= 0.3393
µ
b = E[θ|x] =
n+α+β
25 + 3
56
An estimate of the precision of µ
b is obtained by calculating the standard deviation of θ|x. If
the standard deviation is small, then µ
b is a precise estimate whereas if the standard devation is
large, then µ
b is not a precise estimate.
Here,
(8 + 32 )(17 + 32 )
αβ
=
= 0.007730
V [Y ] =
(α + β)2 (α + β + 1)
28 × 28 × 29
Thus, the standard deviation is 0.0879.
(8.3) Here the prior is Gamma, θ ∼ γ(a, b) with pdf
π(θ) =
ba θa−1 e−bθ
Γ(a)
θ, a, b > 0,
and the data has an exponential distribution: X|θ ∼ exp(θ) with pdf
f (x|θ) = θ exp{−θx}
x ≥ 0, θ > 0.
Again we will take the approach of only looking at the functional form of the posterior, that is
ignoring constants.
So the posterior is:
π(θ|x) ∝ f (x|θ) π(θ)
ba θa−1 e−bθ
∝ θ(a+1)−1 e−(b+x)θ
Γ(a)
a+1 (a+1)−1 −(b+x)θ
(b + x) θ
e
π(θ|x) =
Γ(a + 1)
∝ θe−θx
Thus, θ|x ∼ γ(a + 1, b + x).
Recall that for a γ(α, β), the mean is α/β, the mode is (α − 1)/β and the variance is α/β 2 .
Therefore, the posterior mean is: θ̄ =
a+1
.
b+x
With the given numbers this is
(10 + 1)/(1.5 + 4.8) = 1.746,
and the MAP estimate is:
θb = [(a − 1) + 1]/(b + x) = 10/(1.5 + 4.8) = 1.587.
The posterior standard deviation is:
√
√
a + 1/(b + x) = 10 + 1/(1.5 + 4.8) = 0.526
which is large compared to the estimates so the estimates are not precise.
49
I NTRODUCING P ROBABILITY AND S TATISTICS
S TANDARD D ISTRIBUTIONS
1. A Bernoulli random variable, X, with parameter θ has probability mass function
p(x; θ) = θx (1 − θ)1−x
x = 0, 1
(0 < θ < 1),
and mean and variance E[X] = θ and Var[X] = θ(1 − θ).
2. A geometric random variable, X, with parameter θ has probability mass function
p(x; θ) = θ(1 − θ)x−1
x = 1, 2, . . .
(0 < θ < 1),
and mean and variance E[X] = 1/θ and Var[X] = (1 − θ)/θ2 .
3. A negative binomial random variable, X, with parameters r and θ has probability mass
function
x−1 r
p(x; r, θ) =
θ (1 − θ)x−r
r−1
x = r, r + 1, . . .
(r > 0 and 0 < θ < 1),
and mean and variance E[X] = r/θ and Var[X] = r(1 − θ)/θ2 .
4. A binomial random variable, X, with parameters n and θ (where n is a known positive integer
has probability mass function
n x
p(x; n, θ) =
θ (1 − θ)n−x
x
x = 0, 1, . . . , n
(0 < θ < 1),
and mean and variance E[X] = nθ and Var[X] = nθ(1 − θ).
5. A Poisson random variable, X, with parameter θ has probability mass function
p(x; θ) =
θx e−θ
x!
x = 0, 1, . . .
(θ > 0),
and mean and variance E[X] = θ and Var[X] = θ.
6. A uniform random variable, X, with parameter θ has probability density function
f (x; θ) =
1
θ
0 < x < θ,
(θ > 0),
and mean and variance E[X] = θ/2 and Var[X] = θ2 /12.
7. An exponential random variable, X, with parameter λ has probability density function
f (x; λ) = λe−λx
x>0
(λ > 0),
and mean and variance E[X] = 1/λ and Var[X] = 1/λ2 .
8. A normal random variable, X, with parameters µ and σ 2 has probability density function
1
1
2
2
f (x; µ, σ ) = √
exp − 2 (x − µ)
−∞ < x < ∞ (−∞ < µ < ∞, σ 2 > 0),
2
2σ
2πσ
and mean and variance E[X] = µ and Var[X] = σ 2 .
50
I NTRODUCING P ROBABILITY AND S TATISTICS
S TANDARD D ISTRIBUTIONS
9. A gamma random variable, X, with parameters α and β has probability density function
f (x; α, β) =
β α xα−1 e−βx
Γ(α)
x>0
(α, β > 0),
R∞
xα−1 e−x dx, and mean and variance E[X] = α/β and Var[X] = α/β 2 . Note
√
that Γ(α + 1) = αΓ(α) for all α and Γ(α + 1) = α! for integers α > 1. Also Γ(1/2) = π.
where Γ(α) =
0
10. A beta random variable, X, with parameters α and β has probability density function
f (x; α, β) =
where B(α, β) =
R1
0
xα−1 (1 − x)β−1
B(α, β)
0<x<1
(α, β > 0),
xα−1 (1 − x)β−1 dx = Γ(α)Γ(β)/Γ(α + β), and mean and variance E[X] =
α/(α + β) and Var[X] = αβ/{(α + β)2 (α + β + 1)}.
11. A Pareto random variable, X, with parameters θ and α has probability density function
αθα
xα+1
f (x; θ, α) =
and mean and variance E[X] =
αθ
(α−1)
x>θ
and Var[X] =
(θ, α > 0),
αθ2
(α
(α−1)2 (α−2)
> 2).
12. A chi-square random variable, X, with degrees of freedom parameter n (n is a positive integer) has probability density function
n2 n −1 − x
1
x2 e 2
f (x; n) =
2
Γ( n2 )
x>0
and mean and variance E[X] = n and Var[X] = 2n.
13. A Student’s t random variable, X, with degrees of freedom parameter n (n is a positive
integer) has probability density function
Γ( n+1
)
2
f (x; n) = √
nπ Γ( n2 ) 1 +
x2
n
−∞<x<∞
n+1
2
and mean and variance E[X] = 0 (n > 1) and Var[X] = n/(n − 2) (n > 2).
14. An F random variable, X, with degrees of freedom parameters m and n (m, n are positive
integers) has probability density function
f (x; m, n) =
and mean and variance E[X] =
m m2
n
n
(n
n−2
m
Γ( m+n
) x 2 −1
2
m+n
2
Γ( m2 )Γ( n2 ) 1 + mx
n
> 2) and Var[X] =
51
x>0
2n2 (m+n−2)
(n
m(n−2)2 (n−4)
> 4).
I NTRODUCING P ROBABILITY AND S TATISTICS
N ORMAL TABLES
Normal Distribution Function Tables
The first table gives
1
Φ(x) = √
2π
Z
x
1 2
e− 2 y dy
−∞
and this corresponds to the shaded area in the figure
to the right. Φ(x) is the probability that a random
variable, normally distributed with zero mean and unit
variance, will be less than or equal to x. When x < 0
use Φ(x) = 1−Φ(−x), as the normal distribution with
mean zero is symmetric about zero. For interpolation
use the formula
x − x1
Φ(x) ≈ Φ(x1 ) +
Φ(x2 ) − Φ(x1 )
x2 − x1
(x1 < x < x2 )
Table 1
x
Φ(x)
x
Φ(x)
x
Φ(x)
x
Φ(x)
x
Φ(x)
x
Φ(x)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.5000
0.5199
0.5398
0.5596
0.5793
0.5987
0.6179
0.6368
0.6554
0.6736
0.6915
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.6915
0.7088
0.7257
0.7422
0.7580
0.7734
0.7881
0.8023
0.8159
0.8289
0.8413
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
1.50
0.8413
0.8531
0.8643
0.8749
0.8849
0.8944
0.9032
0.9115
0.9192
0.9265
0.9332
1.50
1.55
1.60
1.65
1.70
1.75
1.80
1.85
1.90
1.95
2.00
0.9332
0.9394
0.9452
0.9505
0.9554
0.9599
0.9641
0.9678
0.9713
0.9744
0.9772
2.00
2.05
2.10
2.15
2.20
2.25
2.30
2.35
2.40
2.45
2.50
0.9772
0.9798
0.9821
0.9842
0.9861
0.9878
0.9893
0.9906
0.9918
0.9929
0.9938
2.50
2.55
2.60
2.65
2.70
2.75
2.80
2.85
2.90
2.95
3.00
0.9938
0.9946
0.9953
0.9960
0.9965
0.9970
0.9974
0.9978
0.9981
0.9984
0.9987
The inverse function Φ−1 (p) is tabulated below for various values of p.
Table 2
p
Φ−1 (p)
0.900
1.2816
0.950
1.6449
0.975
1.9600
52
0.990
2.3263
0.995
2.5758
0.999
3.0902
0.9995
3.2905