Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introducing Probability and Statistics: A concise course on the fundamentals of statistics by Dr Robert G Aykroyd Department of Statistics, University of Leeds c RG Aykroyd and University of Leeds, 2014. Section 4 produced in collaboration with S Barber. Introducing Probability and Statistics “Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the natural and social sciences to the humanities, government and business. Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modelled in a way that accounts for randomness and uncertainty in the observations, and are then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics. There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject.” (Source: http://en.wikipedia.org/wiki/Statistics) This short course aims to give a quick reminder of many basic ideas in probability and statistics. The material is selected from an undergraduate module on mathematical statistics, and hence emphasises the “theoretical basis” which underpins applied statistics. The topics covered are a mix of practical methods and mathematical foundations. If you are familiar with most of these ideas, then you are well prepared for your studies. If, on the other hand, you find some of the topics new then please take some extra time to understand the ideas and complete the exercises. Outline of the course: 1. BASIC PROBABILITY. Events, sample space and the axioms. Random variables. Expectation and variance. C ONDITIONAL P ROBABILITY. Conditional probability and independence. Expectation and variance. Total probability and Bayes Theorem. S TANDARD D ISTRIBUTIONS. Binomial, Poisson, exponential and normal. Moment generating functions. Sampling distributions. 1 4. 5. L INEAR R EGRESSION. The linear regression model. Vector form of regression. C LASSICAL E STIMATION. Method of Moments. Maximum likelihood. Properties of estimators. Hypothesis Testing. Likelihood ratio test. Exercises. 13 17 6. T HE N ORMAL D ISTRIBUTION. Transformations to normality. Approximations and the central limit theorem. D ERIVED D ISTRIBUTIONS. Function of random variables. Sums of independent variables. Student-t, Chi-squared and F distributions. Exercises. 22 BAYESIAN ESTIMATION. Subjective probability and expert opinion. Definitions of prior, likelihood and posterior. Posterior estimation. Exercises. 30 Practical Exercises. Solutions to Practical Exercises. Solutions to Theoretical Exercises. Standard Distributions and Tables. 35 37 39 50 2. 3. 7. 8. Useful references: Rice JA, Mathematical Statistics and Data Analysis, Duxbury Press, 2nd Ed, 1995 Stirzaker DR, Elementary Probability, CUP, 2003 (online at University library). 4 8 24 I NTRODUCING P ROBABILITY AND S TATISTICS 1 1.1 1 BASIC P ROBABILITY Basic Probability Introduction Probability is a branch of mathematics which rigorously describes uncertain (or random) systems and processes. It has its roots in the 16th/17th century with the work of Cardano, Fermat and Pascal, but it is also an area of modern development and research. Put simply, probability measures the likelihood or chance of some event occurring: probability zero means the event is impossible whereas a probability of 1 means that the event is certain. The larger the probability, the more likely the event. Applications include: modelling hereditary disease in genetics, pension calculations in actuarial science, stock pricing in finance, epidemic modelling in public health, and many more! 1.2 Events and axioms The set of all possible outcomes is the sample space Ω (the Greek letter, capital “omega”), and we may be interested in the chance of some particular outcome, or event, occurring. An event, often denoted A, B, C, · · · , is a set of outcomes of an experiment. The set can be empty, A = ∅, giving an impossible event, or it can be equal to the sample space, A = Ω, giving a certain event. These extremes are not very interesting and so the event will usually be a non-empty, proper subset of the sample space. Probabilities must satisfy the following simple rules: The (Kolmogorov) axioms: K1 P r(A) ≥ 0 for any event A, K2 P r(Ω) = 1 for any sample space Ω, K3 P r(A ∪ B) = P r(A) + P r(B) for any mutually exclusive events A and B (that is when A ∩ B = ∅). Clearly, these are very basic properties but they are sufficient to allow many complex rules to be derived, such as: The general addition rule: P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B). F URTHER READING: Sections 1.2-1.4 of Stirzaker. 1 I NTRODUCING P ROBABILITY AND S TATISTICS 1.3 1 BASIC P ROBABILITY Random variables Whenever the outcome of a random experiment is a number, then the experiment can be described by a random variable. It is conventional to use capital letters to denote random variables, e.g. X, Y, Z. The range space of a random variable X, is a set, SX , of all possible values of the random variable, eg SX = {a1 , a2 , ..., ar , ...} or SX = [0, ∞). A discrete random variable is a random variable with a finite (or countably infinite) range space. A continuous random variable is a random variable with an uncountable range space. For a discrete random variable, X say, the probability of the random variable taking a particular element of the range space is P r(X = ar ) (or pX (x)) – this is called the probability mass function. When the random variable, Y say, is continuous we have a function, fY (y), to describe the density of probability over the range space – this is called the probability density function. Alternatively, the probabilities may be summarised by a distribution function defined by FZ (z) = P r(Z ≤ z). For discrete random variables this is obtained by summing the probability mass function, and for continuous random variables by integrating the probability density function, FX (x) = X Z pX (r) y FY (y) = fY (t)dt. −∞ r≤x As a consequence of this last result, a probability density function can be obtained from the corresponding distribution function by differentiating fY (y) = d (FY (y)) . dy F URTHER READING: Sections 2.1, 2.2 and 15.3.2 of Rice and Section 4.1 of Stirzaker. An interesting discussion of randomness can be found at http://en.wikipedia.org/wiki/Randomness 2 I NTRODUCING P ROBABILITY AND S TATISTICS 1.4 1 BASIC P ROBABILITY Expectations and variance The expectation (or mean) of a random variable is defined as: E [X] = P x xp(x) R x for discrete X xf (x)dx for continuous X and the moments (about zero) are defined by: r E [X ] = P r x x p(x) R x for discrete X xr f (x)dx for continuous X The expectation of a function of a random variable is given by: E [g(X)] = P x g(x)p(x) R x for discrete X g(x)f (x)dx for continuous X The variance V ar(X) = E (X − µ)2 = P x (x R x − µ)2 p(x) (discrete) (x − µ)2 f (x)dx (continuous) where µ = E[X]. It is usually easier, however, to calculate the variance using V ar(X) = E X 2 − {E[X]}2 where E [X 2 ] is the expectation of X-squared, i.e. the second moment about zero. F URTHER READING: Section 4.3 of Stirzaker, and http://en.wikipedia.org/wiki/Expected value 3 I NTRODUCING P ROBABILITY AND S TATISTICS 2 2.1 2 C ONDITIONAL P ROBABILITY Conditional Probability Definitions For two discrete random variables, X and Y , we have: J OINT PROBABILITY MASS FUNCTION: p(x, y) = P r(X = x, Y = y), where (i) 0 ≤ p(x, y) ≤ 1, for all x, y, and (ii) XX x p(x, y) = 1. y M ARGINAL PROBABILITY MASS FUNCTIONS: pX (x) = X p(x, y) pY (y) = X p(x, y) x y C ONDITIONAL PROBABILITY MASS FUNCTIONS: pX|Y (x|y) = p(x, y) pY (y) where pY (y) > 0, pY |X (y|x) = p(x, y) pX (x) where pX (x) > 0 F URTHER READING: Chapter 3 and Section 4.4 of Rice, and Sections 2.1 and 2.2 of Stirzaker. 4 I NTRODUCING P ROBABILITY AND S TATISTICS 2 C ONDITIONAL P ROBABILITY Continuous case For two continuous random variables, X and Y , we have: J OINT PROBABILITY DENSITY FUNCTION: f (x, y), where −∞ < x < ∞, −∞ < y < ∞, (i)f (x, y) ≥ 0, for all x, y, and Z ∞ Z ∞ f (x, y) dx dy = 1. (ii) y=−∞ x=−∞ M ARGINAL PROBABILITY DENSITY FUNCTIONS: Z ∞ Z f (x, y) dy fX (x) = fY (y) ∞ f (x, y) dx = x=−∞ y=−∞ C ONDITIONAL PROBABILITY DENSITY FUNCTIONS: 2.2 fX|Y (x|y) = f (x, y) fY (y) where fY (y) > 0, fY |X (y|x) = f (x, y) fX (x) where fX (x) > 0 Independent random variables Two random variables X and Y are independent if and only if p(x, y) = pX (x)pY (y) for all discrete x, y f (x, y) = fX (x)fY (y) for all continuous x, y. F URTHER READING: Sections 5.1-5.3 and 4.4 of Stirzaker. 5 I NTRODUCING P ROBABILITY AND S TATISTICS 2.3 2 C ONDITIONAL P ROBABILITY Expectations and correlation Consider random variables X and Y , with joint probability density function f (x, y) or joint probability mass function p(x, y), then for any function h(x, y): E [h(X, Y )] = P P x y h(x, y)p(x, y) R R x y for discrete X, Y h(x, y)f (x, y)dydx for continuous X, Y. For example, the (r, s)th moment about zero E [X r Y s ], has h(x, y) = xr y s so that E[XY ] uses h(x, y) = xy. Further, the (r, s)th moment about the mean, given by E [(X − µX )r (Y − µY )s ] has h(x, y) = (x − µX )r (y − µY )s where µX = E[X] and µY = E[Y ]. Then the correlation of X and Y can be found as: Cov(X, Y ) Corr (X, Y )) = p V ar(X)V ar(Y ) where Cov(X, Y ) = E [(X − µX )(Y − µY )] is the covariance of X and Y . If X and Y are independent, then the covariance is zero and hence the correlation is zero – whenever the correlation is zero, then the variables are said to be uncorrelated. Note, however, that in general uncorrelated does not mean the variables are independent. Given random variables X and Y , the conditional expectation of Y given that X = x is: E [Y |X = x] = P y y pY |X (Y = y|X = x) (discrete) R y y fY |X (y|x)dy (continuous). Clearly, in either of these definitions the conditional distribution could be replaced by the ratio of joint distribution to the appropriate marginal distribution. For any function h(Y ), the conditional expectation of h(Y ) given X = x is given by E [h(Y )|X = x] = P y h(y)pY |X (Y = y|X = x) (discrete) R y h(y)fY |X (y|x)dy 6 (continuous). I NTRODUCING P ROBABILITY AND S TATISTICS 2.4 2 C ONDITIONAL P ROBABILITY Total probability and Bayes Theorem Suppose that we are interested in the probability of some event A, but that it is not easy to evaluate p(A) directly. Firstly, let the events B1 , B2 , . . . , Bk partition the sample space. For B1 , B2 , . . . , Bk to be a partition of the sample space Ω, they must be (i) mutually exclusive, that is Bi ∩ Bj = ∅ (for i 6= j) and (ii) exhaustive, that is B1 ∪ B2 ∪ · · · ∪ Bk = Ω. Further suppose that we can easily find p(A|Bj ) (for j = 1, . . . , k) then Total probability rule: P r(A) = k X P r(A|Bj )P r(Bj ). j=1 Further, suppose that we have a conditional probability, P r(A|B) for example, but we are interested in the probability of the events conditioned the other way, that is P r(B|A), then Bayes theorem (1): P r(B|A) = P r(A|B)P r(B) P r(A) when P r(A) > 0. In general, if B1 , B2 , . . . , Bk is a partition, as above, and we use the total probability rule, then we can write Bayes theorem (2): P r(Bi |A) = P r(A|Bi )P r(Bi ) k X P r(A|Bj )P r(Bj ) j=1 F URTHER READING: Section 2.1 of Stirzaker. 7 i = 1, . . . , k. I NTRODUCING P ROBABILITY AND S TATISTICS 3 3.1 3 S TANDARD D ISTRIBUTIONS Standard Distributions Example distributions Binomial distribution, B(n, π) The binomial distribution can be defined as the number of successes in n independent Bernoulli trials with two possible outcomes (success and failure) with probabilities π and 1 − π. n x p(x) = π (1−π)n−x , x x = 0, 1, ..., n (0 < π < 1). V ar(X) = nπ(1 − π) E[X] = nπ Poisson distribution, P o(λ) The Poisson distribution is often used as a model for the number of occurrences of rare events in time or space, such as radioactive decays. e−λ λx , p(x) = x! E[X] = λ x = 0, 1, ... (λ > 0). V ar(X) = λ Exponential distribution, exp(λ) The exponential distribution is often used to describe the time between events which occur at random, or to model “lifetimes”. It possesses the so-called “memoryless” property. f (x) = λe−λx , E[X] = 1 λ x≥0 (λ > 0). V ar(X) = 1 λ2 Normal distribution, N (µ, σ 2 ) The normal (or Gaussian) distribution is the most widely used. It is convenient to use, often fits data well and can be theoretically justified (via the central limit theorem). 1 (x − µ)2 exp − f (x) = √ 2 σ2 2πσ 2 1 E[X] = µ , −∞ < x < ∞. V ar(X) = σ 2 F URTHER READING: Section 4.2, 4.3 and 7.1 of Stirzaker. 8 I NTRODUCING P ROBABILITY AND S TATISTICS 3.2 3 S TANDARD D ISTRIBUTIONS Moment generating functions The moment generating function (mgf) of a random variable X is defined as MX (t) = E[etX ] = R tx P tx e p (x), if discrete; e fX (x)dx if continuous, and it exists provided the sum or integral X x x converges in an interval containing t = 0. 1. The mgf is unique to a probability distribution. 2. By considering the (Taylor) power series expansion MX (t) = ∞ r X t r=0 r! E[X r ], we see that E[X r ] is the coefficient of tr /r! 3. Moments can easily be found by differentiation d r E[X ] = r MX (t) dt r t=0 i.e. E[X r ] is the rth derivative of MX (t) with t = 0. 4. If X has mgf MX (t) and Y = aX + b, where a and b are constants, then the mgf of Y is MY (t) = ebt MX (at) 5. If X and Y are independent random variables with mgfs MX (t) and MY (t) respectively, then Z = X + Y has mgf given by MZ (t) = MX (t)MY (t). Extending this to n independent random variables, Xi , i = 1, 2, ..., n with mgfs Mxi (t), i = P 1, 2, ..., n, then the mgf of Z = Xi is MZ (t) = MX1 (t)MX2 (t)...MXn (t). If Xi , i = 1, 2, ..., n are independent and identically distributed (i.i.d.) with common mgf MX (t) then MZ (t) = {MX (t)}n . 6. If {Xn } is a sequence of random variables with mgfs {MXn (t)}, and X is a random variable with mgf MX (t) such that lim MXn (t) = MX (t) n→∞ then the limiting distribution of Xn is the distribution of X. 9 I NTRODUCING P ROBABILITY AND S TATISTICS 3.3 3 S TANDARD D ISTRIBUTIONS Sampling and sampling distributions The first task of any research project is the design of the investigation. It is important to gather all information regarding the problem from historical records and from experts. This allows each part of the experimental design, modelling and even analysis to be planned. The target population is the set of all people, products or things about which we would like to draw conclusions. Typically we will be interested in some particular characteristic of the population, such as weight or risk associated with a particular financial product. The sample is a, usually small, sub-set of the population and is selected in such a way as to be representative of the population. We will then use the sample to draw conclusions about the population. The choice of sample size depends on many factors such as the sampling method, the natural variability, measurement error and the required precision of any estimation or the power of any hypothesis tests. Suppose we have a random sample of n observations or measurements, x1 , . . . , xn , of a random variable X. It is very common to summarise the sample using a small number of sample statistics, rather than report the whole sample. The most usual summary statistics are the sample mean and the P P 1 sample variance x̄ = n1 xi and s2 = n−1 (xi − x̄)2 . Other sample summaries are possible, such as median and mode as measures of centre or location of the distribution, or range and inter-quartile range as measures of the spread of the distribution. As well as numerical statistics, it is common to consider graphical representations. Stem-and-leaf and box plots can be used to display the numerical summaries and are particularly useful for comparing general properties between samples. Also, histograms can help to choose, or confirm, a probability model. Numerical statistics are then used to estimate model parameters, for example using the sample proportion to estimate the probability in the binomial, or the sample mean and variance to estimation to population mean and variance in the normal. If we were to repeat the sampling process to obtain other datasets, then we would not expect the various summary statistics to be unchanged – this is due to sampling variation. We can imagine performing the sampling many times and looking at the distribution of the summary statistic – this is the sampling distribution. Suppose we have a random sample, X1 , . . . , Xn , from a normal population with mean µ and variance σ 2 . It can be shown that the sampling distribution of the sample mean also has a normal distribution with mean µ but with variance σ 2 /n, that is X̄ ∼ N (µ, σ 2 /n). We can also derived results about other distributions. For example a good estimator of the probability, π, in the Binomial distribution b = X̄ is a good choice. Notice that each of these is a function of is X̄/n, and that for the Poisson λ the mean and, although, the data are not from a normal distribution we can call on the central limit theorem, if we have a large sample, to justify a normal approximation. That is in the Binomial π b is b approximately N (π, π(1 − π)/n), and for the Poisson λ approximately follows N (λ, λ/n). 10 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES Exercises (1.1) Let X be a random variable with probability mass function pX (x) given by x pX (x) -3 0.1 -1 0.2 0 0.15 1 0.2 2 0.1 3 0.15 5 0.05 8 0.05 Check that pX (x) defines a valid probability distribution, then calculate P (1 ≤ X ≤ 4) and P (X is negative). Evaluate the expected value E[X] and the variance V ar(X). (1.2) Suppose that X has a PDF fX (x) = cx(2 − x) for 0 ≤ x ≤ 2. Find the constant c so that this is a valid PDF. Obtain the cumulative distribution function FX (x), and then find the probability that X > 1. (2.1) Let X and Y have the joint probability mass function given in the following table. x 1 2 3 -2 1/32 1/16 1/32 Value of y -1 0 3/32 3/32 3/16 3/16 3/32 3/32 1 1/32 1/16 1/32 Find the cumulative distribution function of Y , and the conditional distribution of Y given X. Are X and Y independent? (2.2) The joint PDF of X and Y is given by 6 2 xy f (x, y) = x + , 7 2 0 ≤ x ≤ 1, 0 ≤ y ≤ 2. Find the marginal PDFs of X and Y , and then the cumulative distribution function of X. Evaluate the expectation of X, the expectation of X(X − Y ), and the conditional expectation of X given that Y = 1. (2.3) A laboratory blood test is 80% effective in detecting a certain disease when it is in fact present. However, the test also yields a ‘false positive’ result for 5% of healthy persons tested. Suppose that 0.4% of the population actually have the disease. What is the probability that a person found ‘ill’ according to the test does have the disease? (3.1) An exam paper consists of 20 multiple choice questions with 5 possible answers each (only one is correct). In order to get a pass mark, it is necessary to give correct answers to at least 20% of questions. (a) A student has decided to answer just by guessing. What is the probability that he would pass the exam? 11 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES (b) Suppose now that the student pursues an “educated guess”, in that he knows enough to be able to discard two most unlikely answers for each of 20 questions, and will guess at random on the remaining two answers. What are his chances to pass the exam now? (3.2) Suppose that X has an exponential distribution with p.d.f. given by f (x) = λe−λx for x ≥ 0 and f (x) = 0 otherwise, and suppose that λ = 2. (a) Evaluate the probability P r(X > 21 ). (b) Find the value of x such that FX (x) = 21 . (c) Evaluate the probability P r(X > 1 | X > 12 ). (3.3) Suppose that X has a Poisson distribution with parameter λ, then find the MGF and hence the mean and variance of X. Using MGFs, show that the sum of two independent Poisson random variables is also a Poisson random variable. 12 I NTRODUCING P ROBABILITY AND S TATISTICS 4 4 L INEAR R EGRESSION Linear regression and least squares estimation 4.1 Introduction In many sampling situations it is essential to consider related variables simultaneously. Even in situations where there is an exact physical law, measured data will be subject to random fluctuations and hence fitting a functional relationship to data is a common task. There may be some information before an experiment about the type of relationship expected, and this may have been used in the experimental design, but it is always wise to visualize the possible relationship using a scatterplot. The most commonly used model is the straight line. This may be due to a physical law which is linear, or as an local approximation to a nonlinear relationship. In other cases there will be no theoretical justification but it is simply chosen because the data seem to follow a linear pattern. In all cases, it is important to check that this assumption is reasonable both before the analysis, by drawing a scatterplot, and afterwards by performing a residual analysis – these are not covered in this course. 4.2 The linear regression model Suppose we have a dataset containing n paired values, {(xi , yi ) : i = 1, . . . , n}. Consider the simple linear regression model yi = α + βxi + i i = 1, . . . , n, where yi is the response or dependent variable, α and β are regression parameters, xi is the independent or explanatory variable, measured without error, and i is the random error term and is independent and identically distributed (iid) N (0, σ 2 ). Consider a general straight line passing through the data plotted as points on a scatterplot. In general, the points will not lie perfectly on the line, but instead there is an error, or residual, associated with each point. Let the straight line be defined by the equation y = α + βx and the y-value on the line when x = xi is denoted ybi = α + βxi . Now, since we are assuming that the explanatory variable is measured without error, then the residuals are measured only in the y direction. So ri = yi − ybi = yi − (α + βxi ) i = 1, . . . , n. Given observations {(xi , yi ); i = 1, . . . , n}, we estimate the regression parameters by least squares. P To do this, we minimise the sum of squared residuals S = i ri2 . 13 I NTRODUCING P ROBABILITY AND S TATISTICS 4 L INEAR R EGRESSION Using partial differentiation of S with respect to α and β separately we obtain ∂S ∂α X = −2 (yi − α − βxi ) i X ∂S = −2 xi (yi − α − βxi ) . ∂β i b when The minimum can be found by solving these equations, giving parameter estimates α b and β, ∂S/∂α = 0 and ∂S/∂β = 0, that is when X X yi = nb α + βb xi i i X X X xi yi = α b xi + βb x2i i i i — these are called the normal equations. Dividing the first by n gives ȳ = α b + βbx̄, then substituting b gives yb − ȳ = β(x b − x̄). Notice that when x = x̄ then y = ȳ, for α b in the fitted equation, yb = α b + βx, that is the line passes through the centroid, (x̄, ȳ), of the data. Now, dividing the first normal equation by n and re-arranging gives α b= 1X 1X yi − βb xi n i n i which can be substituted into the second normal equation, and after a few steps leads to ! X X X X X X xi yi = xi yi /n + βb x2 − xi xi /n . i i i i i i i Which gives the result, in two alternative forms, P P P P (xi − x̄)(yi − ȳ) x y − x y /n i i i i βb = Pi 2 P i P i = iP . 2 i xi − i xi i xi /n i (xi − x̄) Summarizing this, the least-squares estimator are P b α b = y − βx and We can also write βb = Sxy /Sxx , where we define X Sxx = (xi − x)2 and i Sxy = − x)(yi − y) . 2 i (xi − x) i (x Pi βb = X (xi − x)(yi − y). i To make predictions of the response, yb, corresponding to values of the explanatory variable, x, we b Similarly, fitted values, ybi , can be calculated simply substitute into the fitted equation, yb = α b + βx. b i. corresponding to observed values of the explanatory variable, xi by substitution as ybi = α b + βx 14 I NTRODUCING P ROBABILITY AND S TATISTICS 4.3 4 L INEAR R EGRESSION Vector form of linear regression Consider the centred linear regression model where x̄ has been subtracted from all x-values yi = α0 + β(xi − x) + i i = 1, . . . , n. We keep the same notation for the slope parameter but relabel the intercept. Comparing the two b + βb x = y. versions we see that α0 = α + β x so α b0 = α b + βb x = (y − βx) We can write our centred regression model in vector form as y = X θ + where 1 y1 1 x1 − x " # 0 2 y2 1 x2 − x α and = y= .. , θ = β , .. . .. , X = .. . . . . 1 xn − x yn n Note that we could write the uncentred model in a similar form — we would just have to change the second column of X. Note also that we could include multiple explanatory variables simply by adding more columns to X and more parameters to θ. We can estimate θ by least squares. Note that r = y − Xθ and minimise X S= ri2 = r T r i = (y − Xθ)T (y − Xθ) = y T y − (Xθ)T y − y T Xθ + (Xθ)T (Xθ) = y T y − 2θ T X T y + θ T X T Xθ. Differentiating S with respect to θ gives ∂S = −2X T y + 2X T Xθ, ∂θ and setting this to zero gives the set of equations X T X θb = X T y, which defines the least squares estimators ! αb0 = θb = (X T X)−1 X T y. βb You should check for yourself that this has gives the same parameter estimates as before. Example: Consider the following data on pullover sales (number of pullovers sold) and price (in EUR) per item. The aim is to discover if the price (explanatory variables) influences the overall sales (response variable). Sales Price 230 125 181 165 99 97 150 115 97 120 15 192 100 181 80 189 90 172 95 170 125 I NTRODUCING P ROBABILITY AND S TATISTICS 4 L INEAR R EGRESSION To investigate the relationship between sales (y) and price (x) we can calculate the following values. x2i Sales, yi 15625 230 9801 181 9409 165 13225 150 14400 97 10000 192 6400 181 8100 189 9025 172 15625 170 1727 2198.4 P 2 P ( x i ) ( yi ) xi yi 28750 17919 16005 17250 11640 19200 14480 17010 16340 21250 179844 P ( xi y i ) Fig: Scatterplot of pullover sales against price, with fitted equation. y 100 120 140 160 180 200 220 Price, xi 125 99 97 115 120 100 80 90 95 125 Total 1046 P ( xi ) 80 90 100 x 110 120 First, the means are x = 104.6 and y = 172.7, and then Sxx = 2198.4 and Sxy = −800.2 giving βb = Sxy /Sxx = −800.2/2198.4 = −0.364 (to 3 d.p.) and α b = y − βbx̄ = 172.7 + 0.364 × 104.6 = 210.774. Hence we have the fitted equation Sales = 210.774 − 0.364 × Price. Alternatively, we can fit our regression model by constructing the matrix representation form, defining 1 20.4 230 1 −5.6 181 165 1 −7.6 10.4 150 1 97 1 15.4 and X= y= 192 1 −4.6 . 181 1 −24.6 189 1 −14.6 1 −9.6 172 170 1 20.4 Then we can find α b0 and βb using " # " #−1 " # " # 0 α b 1727.0 172.7 10.0 0.0 = (X T X)−1 X T y = = . b β 0.0 2198.4 −800.2 −0.364 Hence α b=α b0 − βbx̄ = 172.7 + 0.364 × 104.6 = 210.774, as before. 16 I NTRODUCING P ROBABILITY AND S TATISTICS 5 5.1 5 E STIMATION AND T ESTING Classical Estimation and Hypothesis Testing Introduction Statistical inference is the process where we attempt to say something about an unknown probability model based on a set of data which were generated by the model. This inference does not have the status of absolute truth, since there will be (infinitely) many probability models which are consistent with a given set of data. All we can do is to establish that some of these models are plausible, while others are implausible. A common approach is to use a probability model for the data which is completely specified except for the numerical values of a finite number of quantities called parameters. In this chapter we will introduce methods for making inferences about parameters assuming that the given model is correct. The idea is to use the data, xT = (x1 , x2 , ..., xn ), to make a “good guess” at the numerical value of a parameter, θ. b An ESTIMATE, θb = θ(x) is a numeric value which is a function of the data. An ESTIMATOR is a b random variables, θ(X), which is a function of a random sample X T = (X1 , X2 , ..., Xn ). 5.2 Method of Moments Assume that the Xi are mutually independent with common p.d.f. f (x; θ1 , θ2 , ..., θp ). Then the rth population moment (about zero) is E[X r ] = µ0r (θ1 , θ2 , ..., θp ) and the rth sample moment n m0r = 1X r x. n i=1 i The method of moments estimates θ1 , θ2 , ..., θp is the solution of the p simultaneous (non-linear) equations µ0r (θ1 , θ2 , ..., θp ) = m0r , r = 1, 2, ..., p. This method of estimation has no general optimality properties and sometimes does very badly, but usually provides sensible initial guesses for numerical search procedures. Example Let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n be a set of independent observations of this variable. The first sample moment is the sample mean x̄, and the first population moment is the expectation of X, i.e. 1/λ. Hence, we find the method of moments b = x̄, that is λ b = 1/x̄. estimate of λ by solving 1/λ F URTHER READING: Sections 8.1 to 8.5 of Rice. 17 I NTRODUCING P ROBABILITY AND S TATISTICS 5.3 5 E STIMATION AND T ESTING Maximum likelihood estimation The joint pdf of a set of data can be written as f (x; θ). Think of this as a function of θ for a particular data set, and define the likelihood function L(θ) = f (x; θ). An obvious guess at θ is the value which maximises the likelihood, that is the most plausible value given the data. This value is called the maximum likelihood estimate (mle). For technical reasons it is usual to work with the log-likelihood, l(θ) = log L(θ) = log f (x; θ). Further note that if the Xi are mutually independent with common pdf f (.) then l(θ) = n X log f (xi ; θ). i=1 Maximum likelihood estimation enjoys strong optimality properties (at least in large samples). Example Again let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n be a set of independent observations of this variable. The log-likelihood is given by n X log f (xi ; θ) = n log(λ) − λ i=1 n X xi i=1 To find the maximum, differentiate with respect to λ, set equal to zero and solve. This produces the b = 1/x̄. In this case, the m.l.e. is the same as the method of moments estimate. estimate λ 5.4 Properties b is unbiased for, θ, if 1. The most important property is UNBIASEDNESS. An estimator, θ, b = θ. E[θ] b 6= θ, but E[θ] b → θ as n → ∞ then the estimator is If, for small n, E[θ] UNBIASED . ASYMPTOTICALLY 2. An (unbiased) estimator is CONSISTENT if b →0 V ar(θ) as n → ∞. 3. If we have two (or more) estimators, θb and θ̃, which are unbiased, then we might choose the one with smallest variance. The EFFICIENCY of θb relative to θ̃ is defined to be b θ̃) = V ar(θ̃) . ef f (θ, b V ar(θ) 18 I NTRODUCING P ROBABILITY AND S TATISTICS 5.5 5 E STIMATION AND T ESTING Hypothesis Testing Let X1 , . . . , Xn be a random sample from a distribution. The general approach to statistical testing is to consider whether the data are consistent with some stated theory or hypothesis. A hypothesis is a statement about the true probability model, though usually this only concerns the parameter within some specified family, for example N (µ, 1). A simple hypothesis specifies a single point value for parameter, for example µ = µ0 , whereas a composite hypothesis specifies a range, or set, of values, for example µ < µ0 or µ 6= µ0 . We usually assume that there are two rival hypotheses: • Null hypothesis, H0 : µ = µ0 (usually well-defined and simple), • Alternative hypothesis, H1 : “some statement” which is a competitor to H0 . Note: We do not claim to show that H0 or H1 is true, but only to assess if the data provide sufficient evidence to doubt H0 . The null hypothesis usually represents a “bench mark” or a “skeptical stance”, for example “this treatment has no effect on the response”, and we will only reject it if there is overwhelming evidence against it. We make a decision whether to accept H0 or to accept H1 (that is reject H0 ) on the basis of the data. Of course any conclusions is bound to be chancy! There must be a (non-zero) probability of a wrong action, and this is a major characteristic of a statistical test procedure. The types of error can be summarised as follows: Decision Accept H0 Reject H0 H0 True H0 False Correct Wrong Wrong Correct Then, we define, α, the significance level α = P r(Type I Error) = P r(Reject H0 when H0 is true). This is considered the most important of the two types of error, and of course we want α to be small. Next consider the other error, and define the probability as β, β = P r(Type II Error) = P r(Accept H0 when H0 is false) which should be small. Also we define the power function φ = 1 − P r(Type II error) = 1 − β and clearly we want this to be large – a powerful test. Note that this will be a function of the, unknown, true parameter. 19 I NTRODUCING P ROBABILITY AND S TATISTICS 5.6 5 E STIMATION AND T ESTING Examples of simple hypothesis tests Throughout the following examples suppose we have a sample of n observations x1 , x2 , . . . , xn from a normally distributed population with mean µ and variance σ 2 . Example 1: Suppose that we know the population variance, but we do not know the population mean. From the sample we can estimate the population mean using the sample mean, µ b = x̄. We might now wish to test the hypothesis that the population mean is some specified value µ0 compared to the hypothesis that it is not equal to the specified value. That is null hypothesis H0 : µ = µ0 against alternative hypothesis H1 : µ 6= µ0 . A suitable test statistic is the (observed) z-value zobs = x̄ − µ0 √ σ/ n which is then compared to the standard normal distribution. Example 2: Suppose now that the population variance is unknown. For the same hypotheses as above, H0 : µ = µ0 against H1 : µ 6= µ0 , the corresponding test statistic is the (observed) t-value tobs = x̄ − µ0 √ sn−1 / n P where sn−1 is the sample standard deviation defined by s2 = (xi − x̄)2 /(n − 1). This is compared to the (so called) t-distribution with n − 1 degrees of freedom. 5.7 The likelihood ratio test Consider a random sample X1 , . . . , Xn from some distribution with parameter θ and that we wish to test H0 : θ = θ0 , against H1 : θ 6= θ0 . The likelihood ratio statistic is defined as: Λ= L(θ0 ) b L(θ) where θb is the maximum likelihood estimate of θ. Note that 0 ≤ Λ ≤ 1. If there are other unknown parameters, then these are replaced using the appropriate maximum likelihood estimates. We then reject the null hypothesis if Λ if less than some specified value Λ0 say. This is intuitive since values close to zero suggests H1 is true, whereas values close to 1 suggests H0 is true. It is usual to work with the log likelihood-ratio, and we reject if λ = log Λ is close to zero. Equivalently, subject to some conditions, and n is large, Wilkes’ theorem states that W = −2 log Λ is approximately χ21 under H0 . We now reject H0 is W is large, and in particular if it is larger than χ21 (1 − α) for a α100% test. 20 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES Exercises (4.1) A study was made on the amount of converted sugar in a fermentation process at various temperatures. The data were coded (by subtracting 21.5 degrees centigrade from the temperatures) and recorded as follows: Sugar remaining after fermentation Temp., x Sugar, y -0.5 8.1 -0.4 7.8 -0.3 8.5 -0.2 9.8 -0.1 0 9.5 8.9 0.1 0.2 8.6 10.2 0.3 0.4 0.5 9.3 9.2 10.5 Why do you think that the data were coded by subtracting 21.5 from the temperatures? Fit a linear regression model and find the regression equation for your model. (4.2) Consider the multiple linear regression of a response y on two explanatory variables x and w using the model y i = α + β 1 xi + β 2 w i + εi i = 1, . . . , n. P P Assume that the predictors x and w are already centred, so i xi = i wi = 0. Use the method of least squares to find α b and show that βb1 is given by −1 2 Swx Swx Swy Sxy b β1 = 1 − − . Sww Sxx Sxx Sww Sxx where Sxy and Sxx are as defined in the notes and X Sww = (wi − w̄)2 , i Swx X = (wi − w̄)(xi − x̄), and Swy X = (wi − w̄)(yi − ȳ). i i (5.1) In a survey of 320 families with 5 children the number of girls occurred with the following frequencies: Number of girls 0 1 2 3 4 5 Frequency 8 40 88 110 56 18 Explain why the binomial distribution might be a suitable model for this data and clearly state any assumptions. Derive the equation for the maximum likelihood estimator of p, the probability of a girl, and then estimate the value using the data. (5.2) To study a particular currency exchange rate it is possible to model the daily change in log exchange rate by a normal distribution. Suppose the following is a random sample of 10 such values: 0.05 0.29 0.39 -0.18 0.11 0.15 0.35 0.28 -0.17 0.07 Use these data to perform a 5% hypothesis test that the mean change is equal to zero. 21 I NTRODUCING P ROBABILITY AND S TATISTICS 6 6.1 6 T HE N ORMAL D ISTRIBUTION The Normal Distribution Introduction Many statistical method of constructing confidence intervals and hypothesis tests are based on an assumption of normality. The assumption of normality often leads to procedures which are simple, mathematically tractable, and powerful compared to corresponding approaches which do not make the normality assumption. When dealing with large samples results such as the central limit theorem give us confidence that small departures are unlikely to be important. With small samples or when there are substantial violations of a normality assumption the chances of misinterpreting the data and drawing incorrect conclusions seriously increases. Because of this we must carefully consider all assumptions throughout data analysis. Once data have been collected it is important to check modelling assumptions. There are several ways to tell whether a dataset is substantially non-normal such as calculation and testing of skew and kurtosis, and examination of histograms and probability plots. Histograms “approximate” the true probability distribution but can be greatly effected by choice of histograms bins etc. Another approach is to consider the probability plot (or Quantile-Quantile plot) where the expected or theoretical quantiles are plotted against the sample quantiles. If the model is a good fit to the data, then the points should form a straight line – departures from the line indicate departures from the model. 6.2 Definitions Suppose that random variable X follows a normal distribution with mean µ and variance σ 2 then it has probability density function 1 (x − µ)2 1 , −∞ < x < ∞. exp − f (x) = √ 2 σ2 2πσ 2 and we might use the shorthand notation X ∼ N (µ, σ 2 ). The cumulative distribution function (CDF) cannot be evaluated as an explicit equation but must be evaluated numerical. The normal density is unimodal (with mode at its mean) and symmetric about its mean. Hence its mean, median and mode are all equal. We can say that E[X] = µ and Var(X) = σ 2 , but also that the coefficient of skew, E[(X − µ)3 /σ 3 ] = 0, that it is symmetric, also that the coefficient of kurtosis E[(X − µ)4 /σ 4 ] = 3 (the excess kurtosis is defined as zero). If normal random variable Z has mean equal to zero and variance equal to one, then we have the standard normal distribution, Z ∼ N (0, 1). The PDF is sometimes given the notation f (z) = φ(z) and the CDF F (z) = Φ(Z). Note that if X ∼ N (µ, σ 2 ) then (X − µ)/σ ∼ N (0, 1). 22 I NTRODUCING P ROBABILITY AND S TATISTICS 6.3 6 T HE N ORMAL D ISTRIBUTION Transformations to normality Many data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution. This increases the applicability and usefulness of statistical techniques based on the normality assumption. The Box-Cox transformation is a particularly useful family of transformations defined as: λ (x − 1)/λ forλ 6= 0 T (x; λ) = log(x) forλ = 0 where λ is an transformation parameter. There are several important special cases: (i) S QUARE ROOT TRANSFORMATION with λ square root. (ii) L OG = 12 . If necessary, make all positive by adding a constant before taking the TRANSFORMATION with λ = 0. Again, it may be necessary to add a constant to make all values positive before taking logs. (iii) I NVERSE TRANSFORMATION with λ = −1. Notice that simply inverting the values would make small numbers large, and large numbers small - this transformation would reverse the order of the values and great care would be needed in the interpretation. This is not a problem with the Box-Cox transform as the ordering of the values will be identical to the original data. Data transformations are valuable tools, offering many benefits but greater care must be used when interpreting results based on transformed data. 6.4 Approximating distributions Under certain conditions some probability distributions can be approximated by other distributions. Historically, this was important as it gave an easy way to perform probability calculations, but also it helps us to understand relationships between distributions and later to understand transformations from one distribution to another. The P OISSON APPROXIMATION TO THE BINOMIAL works well when n is large and p is small. As a rule of thumb we might consider the approximation satisfactory when, say, n ≥ 20 and p ≤ 0.05 (alternatively when n is large and the expected number of “successes” is small, that is np ≤ 10 say). Another way to think of this is that the Poisson will work well when we are modelling rare events in a very large population. The N ORMAL APPROXIMATION TO B INOMIAL is reasonable when n is large, and p and (1 − p) are NOT too small, say np and n(1 − p) must be greater than 5. Note that the conditions of Poisson approximation to Binomial are complementary to the conditions for Normal Approximation of Binomial Distribution. Perhaps the most powerful result is the CENTRAL LIMIT THEOREM (CLT). Suppose we have a random sample, X1 , X2 , . . . , Xn , from any distribution with finite mean, E[X], and variance, Var(X), then the CLT says that, as the sample size n tends to infinity, the distribution of any sample mean, X̄, tends to the normal with mean E[X] and variance equal to Var(X)/n. 23 I NTRODUCING P ROBABILITY AND S TATISTICS 7 7.1 7 D ERIVED DISTRIBUTIONS Derived Distributions Introduction Initially it may seem that each probability distribution is unrelated to any other distribution, but in fact many are related. As simple cases the binomial, geometric and negative binomial are all generated by repeated Bernoulli trials. There are other examples where one random variable can be derived as a transformation of another, or where one random variable is obtained as a sum of others. Perhaps the most widely used transformations involve the normal distribution, such as linear functions, or the sum of squared normal random variables. Less obviously, a normal random variable divided by a sum of squared normal random variables or the ratio of two sums of squared normal random variables. Each of these corresponds to a common example and the answers should be familiar distributions. In the next sections we will see the mathematical techniques needed to derive many of these results. 7.2 Functions of a random variable For discrete random variables transformations are straightforward. Assuming that the range space and probability mass function of the original random variable are known, then the range space for the transformed random variable can easily be deduced then the probability can be transferred to the elements of the new range space, using an argument of equivalent events. The corresponding treatment of continuous random variables is not so straightforward. We are not simply reallocating probability masses from elements in one range space to elements of another. In this situation, we are dealing with the more subtle concept of density of probability. The simplest approach is to calculate the (cumulative) distribution function of the transformed variable directly, and then differentiate to obtain the density function. Example Consider an exponential random variable X with parameter λ, X ∼ exp(λ), and let Y = X 2 , then FY (y) = P r(Y ≤ y) = P r(X 2 ≤ y) = P r(X ≤ √ √ √ y) = FX ( y) = 1 − e−λ y Now differentiate to give the density function fY (y) = √ √ d d λ Fy (y) = 1 − e−λ y = √ e−λ y dy dy 2 y and the range space of Y is SY = [0, ∞). Note that this density function is unbounded at the origin, unlike the original density function. Although, normally, y = x2 is not regarded as a one-to-one function, over the range space SX it is behaving as one-to-one. F URTHER READING: Sections 2.3, 3.6 and 4.5 of Rice. 24 I NTRODUCING P ROBABILITY AND S TATISTICS 7 D ERIVED DISTRIBUTIONS Result Let X be a continuous random variables with p.d.f. fX (). Suppose that g() is a strictly monotonic function, then the random variable Y = g(X) has p.d.f. fY given by: d −1 −1 fX (g (y)) dy (g (y)) fY (y) = 0 y = g(x) for some x y 6= g(x) for any x. If y = g(x) is not monotonic over the range of X, we split the range into parts for which the function is monotonic (one-to-one relation holds) fY (y) = X fX d −1 −1 g (y) (g (y)) dy y = g(x) where the sum is over the separate parts of the range of X for which x and y are in one-to-one correspondence. Example Consider a random variable X with parameter λ, which has the following p.d.f. fX (x) = λ exp (−λ|x|) , 2 −∞ < x < ∞. This density function looks like two exponential functions place back-to-back, and hence is often referred to as the double exponential, or less descriptively as the Laplace distribution. Consider the transformation Y = X 2 , clearly with this range space for X, the transformation is not one-to-one. However, by dividing the range into two −∞ < x < 0 and 0 < x < ∞, y = x2 is monotonic over each half separately. √ 1 √ For (−∞, 0), X = − Y , dx/dy = − 12 y − 2 and fX (x) = λ2 exp (λx), hence fY (y) = 4√λ y exp −λ y . √ 1 √ For (0, ∞), X = Y , dx/dy = 12 y − 2 and fX (x) = λ2 exp (−λx), hence fY (y) = 4√λ y exp −λ y . Summing the parts from the two ranges gives λ √ fY (y) = √ exp (−λ y) 2 y y ≥ 0. The same distribution as in the earlier example involving the exponential distribution and the transformation Y = X 2 . 25 I NTRODUCING P ROBABILITY AND S TATISTICS 7.3 7 D ERIVED DISTRIBUTIONS Transforming bivariate random variables Suppose we wish to find the joint probability density function of a pair of random variables, Y1 and Y2 , which are given functions of two other random variables, X1 and X2 . Further that Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ), and that the joint probability density function of X1 and X2 is fX1 ,X2 (x1 , x2 ). We assume the following conditions: (I) The transformation (x1 , x2 ) 7→ (y1 , y2 ) is one-to-one. That is we can solve the simultaneous equations y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ), for x1 and x2 to give x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 ) (say). Transformations which are not one-to-one can be handled, but are more complicated except in special cases – such as for sums of independent random variables. (II) The functions h1 and h2 have continuous partial derivatives and the Jacobian determinant is everywhere finite (that is |J| < ∞) where " # J = det ∂x1 ∂y1 ∂x2 ∂y1 ∂x1 ∂y2 ∂x2 ∂y2 Note that there are other ways to write this, ‘all’ are equivalent. Then, fY1 ,Y2 (y1 , y2 ) = |J|fX1 ,X2 (x1 , x2 ) substituting for x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 ) where necessary. The range space for (y1 , y2 ) is obtained by applying the inverse transformation to the constraints on x1 and x2 . Example If X1 and X2 are independent exponential random variables each with parameter λ, then fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = λ2 eλ(x1 +x2 ) , x1 , x2 ≥ 0. Now, if Y1 = X1 + X2 and Y2 = eX1 then x1 = h1 (y1 , y2 ) = log(y2 ) and x2 = h2 (y1 , y2 ) = y1 − log(y2 ). Now, the Jacobian matrix is # " # " ∂x1 ∂x1 1 0 ∂y1 ∂y2 y2 = ∂x2 ∂x2 1 1 − ∂y1 ∂y2 y2 and so the absolute value of its determinant is |J| = 1/y2 (this is finite because it can also be shown that y2 ≥ 1). Then 1 fY1 ,Y2 (y1 , y2 ) = λ2 eλy1 , y1 ≥ log y2 , y2 ≥ 1. y2 7.4 Sums of independent random variables Some results 1. If X1 , ..., Xn are independent Poisson random variables with parameters λ1 , ..., λn , then X1 + ... + Xn also has a Poisson distribution with parameter (λ1 + ... + λn ). 26 I NTRODUCING P ROBABILITY AND S TATISTICS 7 D ERIVED DISTRIBUTIONS 2. If X1 , ..., Xk are independent Binomial random variables with parameters (n1 , p), ..., (nk , p), then X1 + ... + Xk also has a Binomial distribution with parameters (n1 + ... + nk , p). 3. If X1 , ..., Xn are independent gamma random variables with parameters (t1 , λ), ..., (tn , λ), then X1 + ... + Xn also has a gamma distribution with parameters (t1 + ... + tn , λ). 4. If X1 , ..., Xn are independent normal random variables with parameters (µ1 , σ12 ), ..., (µn , σn2 ), then X1 + ... + Xn also has a normal distribution with parameters (µ1 + ... + µn , σ12 + ... + σn2 ). Direct method If X and Y are independent random variables then the probability function for Z = X + Y is pZ (z) = X pX (x)pY (z − x) = X x Z y Z fX (x)fY (z − x)dx = fZ (z) = pX (z − y)pY (y) if discrete x fX (z − y)fY (y)dy, if continuous. y Using generating functions The above results can be derived most easily using moment generating functions (or probability generating functions for the discrete cases) using the result that if Z = X1 + ... + Xn and the X are Q independent then MZ (t) = MXi (t). Of course we must be able to recognise the mgf of Z to identify the distribution. 27 I NTRODUCING P ROBABILITY AND S TATISTICS 7.5 7 D ERIVED DISTRIBUTIONS Distributions derived from the Normal distribution The most frequently used techniques in statistics are the t-test and the F-test. These are used to compare means of two or more samples and to make inferences about the population means from which the samples were drawn. The test statistic in each case is not an arbitrary function of the data, but is chosen to have useful properties. In particular the function is chosen so that its distribution is known. • If X has a standard normal distribution and independently Y has a chi-squared distribution with ν degrees of freedom then √X has a t-distribution with ν degrees of freedom. Y /ν • If X1 and X2 have independent chi-squared distributions with ν1 and ν2 degrees of freedom then X1 /ν1 has an F-distribution with degrees of freedom ν1 and ν2 . X2 /ν2 Preliminary Results: Distribution of the mean and variance Consider a random sample X1 , ..., Xn from a normal population with mean µ and variance σ 2 , that is P P 1 Xi and S 2 = n−1 (Xi − X̄)2 , then Xi ∼ N (µ, σ 2 ), i = 1, ..., n. If we define X̄ = n1 (a) X̄ ∼ N (µ, σ 2 /n), (b) (n − 1)S 2 /σ 2 ∼ χ2n−1 , and (c) X̄ and S 2 are independent. The t-distribution Suppose we have a random sample X1 , ..., Xn from a normal population, Xi ∼ N (µ, σ 2 ), i = 1, ..., n, X̄−µ √ ∼ N (0, 1), whereas, if we estimate with sample mean X̄ and variance S 2 . If σ 2 is known, then σ/ n σ 2 by S 2 , then X̄−µ √ S/ n ∼ tn−1 , that is a t-distribution with degrees of freedom n − 1. The F-distribution Suppose we have two independent random samples of size n1 and n2 from normal populations N (µ1 , σ12 ) and N (µ2 , σ22 ) with sample means and variances X̄1 , S12 and X̄2 , S22 . Imagine we want to test Ho : σ12 = σ22 . If Ho is true then S12 /S22 ≈ 1; is false either S12 /S22 is large (σ12 > σ22 ) or S12 /S22 is close to zero (σ12 < σ22 ). Now X1 = (n1 − 1)S12 /σ12 ∼ χ2n1 −1 and X2 = (n2 − 1)S22 /σ22 ∼ χ2n2 −1 thus F = S 2 /σ 2 S2 X1 /(n1 − 1) = 12 12 = 12 under H0 X2 /(n2 − 1) S2 /σ2 S2 and so F ∼ Fn1 −1,n2 −1 , an F-distribution with degrees of freedom (n1 − 1) and (n2 − 1). F URTHER READING: Sections 3.6 and 4.5 of Rice. 28 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES Exercises (6.1) Let Z be a standard normally distributed random variable then find: (a) P r(Z < 2), (b) P r(−1 < Z < 1) and (c) P r(Z 2 > 3.8416). Hint: In (c) find the probability of an equivalent event involving Z and not Z 2 . (6.2) Suppose that X is a normally distributed random variable with mean µ and standard deviation σ > 0, then it has MGF given by 1 22 MX (t) = exp µt + σ t . 2 Find the MGF of X ∗ = (X − µ)/σ and hence state the distribution of X ∗ . (6.3) Let X be a normally distributed random variable with mean 10 and variance 25. (a) Evaluate the probability P r(X 6 8). (b) Evaluate the probability P r(15 6 X 6 20). (6.4) Let X follow a binomial distribution and parameters n = 50 and p = 0.52. What are the expectation and variance of X? Hence write down the normal distribution which approximates this binomial distribution. Is this likely to be a good approximation? Use the normal approximation, with a continuity correction, to evaluate the probability that X is at least 30. Why is the continuity correction needed? (7.1) If X is a continuous random variable with a uniform distribution on the interval [0, 1], that is with PDF fX (x) = 1 for 0 < x < 1, then find the PDF of Y = − log(X)/λ where λ > 0. Name the distribution. (7.2) Suppose X1 , . . . , Xn are independent normal random variables, with corresponding parameters (µ1 , σ12 ), . . . , (µn , σn2 ), then, using MGFs, show that Sk = X1 + · · · + Xn also has a normal distribution. What are the parameters of this new distribution? Suppose now that the random variables are also identically distributed, that is with common mean µ and variance σ 2 . What can be said about the distribution of X̄ = n1 Sk ? 29 I NTRODUCING P ROBABILITY AND S TATISTICS 8 8 BAYESIAN M ETHODS Bayesian Methods 8.1 Introduction The Bayesian approach to statistics is currently very fashionable and respectable, but this has not always been the case! Until, perhaps, 20 years ago Bayesian statisticians were seen as extremist and fanatical. Leading statisticians of the day considered their work was unimportant and even “dangerous”. The main reason for this lack of trust is the subjective nature of some of the modelling. The key difference, compared to classical statistics, is the use of subjective knowledge in addition to the usual information from data. Suppose we are interested in parameter θ. In the standard setting, we would perform an experiment and use the data to estimate θ. But in practice we might have some knowledge about θ before doing the experiment and want to incorporate this prior degree of belief about θ into the estimation process. Let π(θ) be our prior density function for θ quantifying our prior degree of belief. From the data we can calculate the likelihood, L(X|θ). These two sources of information can be combined to give, π(θ|X), the posterior distribution of θ reflecting our belief about θ after the experiment. Recall Bayes Theorem defined in terms of probabilities of events (A and B say), P (A|B) = P (B|A)P (A) P (A ∩ B) = P (B) P (B) The appropriate form of this for our situation is π(θ|x) = L(x|θ)π(θ) p(x) Note however that the divisor is unimportant when making inference about θ and so we can simply say π(θ|x) ∝ L(x|θ)π(θ) that is “Posterior pdf is proportional to Likelihood times prior pdf” The Bayesian method gives a way to include extra information into the problem and can make logical interpretation easier. Although the approach is straightforward, there can be serious (algebraic) difficulties deriving the posterior distribution. Also, there are many possible choices for the prior distribution - with the chance that the final conclusion might depend on this subjective choice of prior. One approach to choice of prior is to use a non-informative prior (such as the uniform in the following example) which does not have an influence on the modelling, or a vague prior where the influence is mild. To make deriving the posterior distribution easier, and to give a standard approach to choice of prior it is common to use a conjugate prior. That is, given the likelihood, the prior is chosen so that the prior and posterior distributions are in the same family. 30 I NTRODUCING P ROBABILITY AND S TATISTICS 8.2 8 BAYESIAN M ETHODS Conjugate prior distributions To be able to progress much further we must first consider two new examples of continuous distributions, the beta distribution and the gamma distribution. These are particularly important as they are conjugate prior distributions for several widely used likelihood models. For example the beta is the conjugate prior for all the distribution based on the Bernoulli, that is the geometric and binomial. The gamma is the conjugate prior distribution for the Poisson and the exponential. However, for the most widely encountered data model, the normal distribution, it is the normal distribution itself which is the conjugate prior. Beta distribution, β(p, q) xp−1 (1 − x)q−1 0 ≤ x ≤ 1; B(p, q) R1 where B(α, β) = 0 xα−1 (1 − x)β−1 dx. f (x) = p, q > 0. Notes that B(α, β) = Γ(α)Γ(β)/Γ(α + β) and that E[X] = α/(α + β) and Var[X] = αβ/{(α + β)2 (α + β + 1)}. As a special case, when α = β = 1, then this reduces to the continuous uniform distribution on the interval (0, 1). Gamma distribution, γ(α, λ) λα xα−1 e−λx Γ(α) R ∞ α−1 −x where Γ(α) = 0 x e dx. f (x) = x ≥ 0; α, λ > 0. Note that Γ(α+1) = αΓ(α) for all α, hence Γ(α+1) = α! for integers α > 1, and that Γ(1/2) = Also, E[X] = α/λ and Var[X] = α/λ2 . √ π. As important special cases we have (a) when α = 1 then this reduces to the exponential distribution with parameter λ, and (b) when α = ν/2 and λ = 2 then it becomes the chi-squared distribution with ν degrees of freedom χ2ν . 31 I NTRODUCING P ROBABILITY AND S TATISTICS 8 BAYESIAN M ETHODS Example: Coin tossing: Let θ be the probability of getting a head with a biased coin. In n tosses of the coin we observe X = x heads, then n x p(x|θ) = θ (1 − θ)n−x , x = 0, 1, 2, . . . x Now suppose we only know that θ is on the probability scale, and so we have a uniform prior, π(θ) = 1, 0 < θ < 1. Now posterior is proportional to likelihood times prior, n x π(θ|x) ∝ p(x|θ)π(θ) = θ (1 − θ)n−x . 1 ∝ θx (1 − θ)n−x x Notice that this is the form of a Beta distribution, that is it depends on the variable, θ, in the correct way. Hence the posterior distribution is Beta and we can identify the parameters as p = x + 1 and q = n − x + 1, that is θ|x ∼ β(x + 1, n − x + 1). We can now write-down the pdf θx (1 − θ)n−x . π(θ|x) = B(x + 1, n − x + 1) 8.3 Point and interval estimation In classical statistics we have been interested in estimating a parameter θ. This can also be done in Bayesian statistics. Recall that the posterior distribution contains all the information about θ - hence we based all our estimation on the posterior pdf. Natural estimators of θ are: the Posterior Mean or Bayes Estimator that is E[θ|X = x], and the Posterior Mode or Maximum a Posterior (MAP) Estimator. The MAP estimator is the most likely value of θ and is the analogue of the maximum likelihood estimator. To reflect the precision in this estimation we can construct a credibility interval (the equivalent of the classical confidence interval). A 100(1 − α)% credibility interval for θ can be found using the probability statement P r(θL ≤ θ ≤ θU ) = 1 − α This can be interpreted as, the probability of θ being inside the interval is 1 − α (this is much more intuitive than the interpretation of the classical confidence interval). On its own this does not give a unique definition of the interval and so we can introduce the extra condition that says P r(θ ≤ θL ) = P r(θ ≥ θU ) = α/2 this is called the equal-tailed interval. Example: Coin tossing (Continued): Since the posterior pdf is a Beta distribution we already know equations for the two point estimators: the mean is (x + 1)/(n + 2) and the mode is x/n (which is the same as the MLE). 32 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES Exercises (8.1) Suppose that we have a single observation x from a Poisson distribution with parameter θ. Derive the posterior distribution, π(θ|x), when the prior distribution of θ is a Gamma(a, b) distribution. Show that this is a conjugate prior. Write down the posterior mean θ̄ = E[θ|X = x] and the maximum a posterior (MAP) estimator θb = arg max π(θ|X = x). For observation x = 3, and prior parameters a = 2 and b = 0.7, what is the corresponding posterior distribution. Draw a graph of the prior and posterior distributions and comment. Find the posterior mean and the MAP estimate. (8.2) The number of defective items, X, in a random sample of n has a Binomial distribution where the probability that a single item is defective is θ (0 < θ < 1). If the prior distribution of θ is the Beta distribution with parameters α and β, obtain the posterior distribution of θ given X = x. Determine the posterior mean E[θ|X = x]. In a particular case it is found that: n = 25, x = 8 and the prior belief about θ can be summarised by a distribution with prior mean 12 and prior standard deviation 14 . Determine the posterior mean µ b = E[θ|X = x] and obtain the posterior standard deviation (which gives an indication of the precision of the posterior mean). (8.3) Suppose that we have a single observation x from an exponential distribution with parameter θ. Consider a Gamma(a, b) distribution as a prior for θ. Derive the posterior distribution, π(θ|x), and show that this is a conjugate prior. Write down the posterior mean θ̄ = E[θ|X = x] and the maximum a posterior (MAP) estimator θb = arg max π(θ|X = x). With data x = 4.8, and prior parameters a = 10 and b = 1.5, what is the corresponding posterior distribution. Find the posterior mean and the MAP estimate. Also calculate the posterior standard deviation. Comment. 33 I NTRODUCING P ROBABILITY AND S TATISTICS E XERCISES 34 I NTRODUCING P ROBABILITY AND S TATISTICS P RACTICAL E XERCISES Practical Exercises to be Completed Using R Tomorrow, you will meet the R statistical programming environment. Once you are familiar with R, you might like to try out the exercises below. The following simple exercises will allow you to check some of your early answers, but will also require use of a range of R functions. As well as performing more complicated statistical analyses, R is very useful for performing calculations and for plotting graphs. Over the page is a more complicated example where, although the individual calculations are simple, it would be too time consuming to perform by-hand. 1. In Exercise (1.1), evaluate the expected value and variance of X. Hint: Define vectors for x and the probabilities, take element-wise product then sum. 2. In Exercises (1.2), plot a graph of the probability density function. Hint: Use the curve command. 3. In Exercises (3.1), evaluate the two probabilities that the student passes the exam. Hint: Use the pbinom command. 4. In Exercises (5.1), evaluate the fitted frequencies using the estimated value of p. Hint: Use the dbinom command. 5. In Exercises (5.2), calculate the test statistic and the corresponding p-value. Hint: Use the pt command, or the t-test command. 6. In Exercises (6.1), calculate the three probability values for the standard normal. Hint: Use the pnorm command. 7. In Exercises (6.3), calculate the two probability values for the normal random variable with mean 10 and variance 25. Hint: Use the pnorm command giving the mean and standard deviation. 8. In Exercises (6.4), calculate the exact binomial probability and compare to the previously found approximation. Hint: Use the pbinom command. 9. In Exercises (7.1), simulate some data from the continuous uniform distribution, and draw a histogram. Does this look consistent with the exponential? Hint: Use the runif and hist commands. 10. In Exercises (7.2), simulated two, equal-sized, samples each from a different normal distribution, and calculate the element-wise sum. Draw a histogram and evaluate the mean and variance. Are these consistent with the theoretical result? Hint: Use the rnorm, hist, mean and var commands. 35 I NTRODUCING P ROBABILITY AND S TATISTICS P RACTICAL E XERCISES Extended Practical Exercise Suppose that 100 people are subject to a blood test. However, rather than testing each individual separately (which would require 100 tests), the people are divided into groups of 5 and the blood samples of the people in each group are combined and analysed together. If the combined test is negative, one test will suffice for the whole group; if the test is positive, each of the 5 people in the group will have to be tested separately, so that overall 6 tests will be made. Assume that the probability that an individual tests positive is 0.02 for all people, independently of each other. In general, let N be the total number of individuals, n be the number in each group (with k = N/n the number of groups), and p the probability that an individual tests positive. Consider one group of size n, and let Ti represent the number of test required for the ith group (i = 1, . . . , k). The combined test is negative, and hence one test will be sufficient, with probability P r(Ti = 1) = P r(Combined test is negative) = (1 − p) otherwise it is positive, and n + 1 test are required, with probability P r(Ti = n + 1) = P r(combined test is positive) = 1 − (1 − p)n . Now the expected number of tests for the ith group is E[Ti ] = 1 × P r(Ti = 1) + (n + 1) × P r(Ti = n + 1) = (1 − p)n + (n + 1) (1 − (1 − p)n ) = (n + 1) − n(1 − p)n . The expected total number of test, E[T ], is then given by the sum of the expected numbers for each group E[T ] = E[T1 ] + · · · + E[Tk ] = k × ((n + 1) − n(1 − p)n ) . For the given values, N = 100, n = 5, p = 0.05, the expectations are E[Ti ] = 6−5 (0.98)5 ≈ 1.4804. and therefore, E[T ] = 20 × 1.4804 ≈ 29.6079 ≈ 30. So on average only 30 tests will be required, instead of 100. But, for the given total number of people, is this the best choice of n, and what happens as p varied? Use R to repeat the above calculations, then try other values of n (which lead to integer k) to see if n = 5 gives the smallest expected total number of tests. Repeat this process p = 0.01 and p = 0.5, and comment on the best choice of group size. If possible, produce a line graph of expected total number of tests, E[T ], against group size, n, with separate lines for different values of p. Also, a graph of the optimal choice of n against p. Comment on these graphs. 36 I NTRODUCING P ROBABILITY AND S TATISTICS P RACTICAL E XERCISES Solutions to Practical Exercises 1. > > > > > > > x=c(-3,-1,0,1,2,3,5,8) p=c(0.1,0.2,0.15,0.2,0.1,0.15,0.05,0.05) xp = x*p x2p = x**2*p ex = sum(xp) ex2 = sum(x2p) ex2 - ex**2 2. > curve((3/4)*x*(2-x),0,2) 3. > 1-pbinom(3,20,1/5) > 1-pbinom(3,20,1/3) 4. > > > > x=c(0,1,2,3,4,5) f=c(8,40,88,110,56,18) p=sum(x*f/320)/5 dbinom(0:5,5,p)*320 x=c(0.05,0.29,0.39,-0.18,0.11,0.15,0.35,0.28,-0.17,0.07) xm = mean(x) xsd = sd(x) tobs = (xm-0)/(xsd/sqrt(10)) 2*(1-pt(2.11,9)) 5. > > > > > > > t.test(x) 6. > pnorm(2) > pnorm(1)-pnorm(-1) > pnorm(1.96)-pnorm(-1.96) 7. > pnorm(20,10,5)-pnorm(15,10,5) > pnorm(8,10,5) 8. > 1-pbinom(29,50,.52) 9. > > > > > > > > > 10. > > > > > > > > > > > > > x=runif(100) y=-log(x) hist(y) mean(y) x=runif(1000) y=-log(x)/5 hist(y) mean(y) x=rnorm(1000) hist(x) mean(x) var(x) y=rnorm(1000,10, 5) mean(y) var(y) z=x+y hist(z) mean(z) var(z) sd(z) Extended Exercise > N=100 > n=5 > p=0.02 > k=N/n > eT = k*((n+1)-n*(1-p)**n) > > > ns = 1:20 > eT = rep(0,length(ns)) > > p=0.02 > > for (i in 1:length(ns)){ + k=N/ns[i] + eT[i] = k*((ns[i]+1)-ns[i]* (1-p)**ns[i]) + } > > plot(ns,eT) 37 I NTRODUCING P ROBABILITY AND S TATISTICS P RACTICAL E XERCISES 38 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES Solutions to Exercises (1.1) Clearly all probabilities are between 0 and 1, and they sum to 1. Hence they define a valid probability distribution. [Note that only checking that the probabilities sum to 1 is not sufficient, as for example both (3/5, 3/5, −1/5) and (1/4, 5/4, −1/2) sum to 1, but violate other conditions.] For the first two, add the appropriate probabilities to give 0.45 and 0.3 respectively. To calculate the means and variances we extend the probability table with two extra rows: x pX (x) xpX (x) x2 pX (x) -3 0.1 -0.3 0.90 -1 0.2 -0.2 0.20 0 1 2 3 5 0.15 0.2 0.1 0.15 0.05 0 0.2 0.2 0.45 0.25 0 0.2 0.4 1.35 1.25 8 0.05 0.4 3.2 P P 2 Summing the last two rows, we obtain E[X] = xpX (x) = 1 and E[X 2 ] = x pX (x) = 7.5, and so V ar(X) = E[X 2 ] − (E[X])2 = 7.5 − 12 = 6.5. (1.2) First note that for fX (x) = cx(2 − x) to be a valid density we require the p.d.f. to be always non-negative. Clearly, for 0 ≤ x ≤ 2 we requireRc ≥ 0, and note that for x outside this range ∞ fX (x) = 0 by definition. Also using the fact that −∞ fX (x)dx = 1: Z ∞ Z fX (x)dx = c −∞ 0 2 2 (2x − x2 )dx = c x2 − x3 /3) 0 = 4c/3 Recall the definition of the c.d.f.: FX (x) = P (X ≤ x) = Rx −∞ hence c = 3/4. fX (y)dy. [To avoid possible confusion between the variable over which we are integrating and the upper limit of integration, it is usually safest to re-label one of them.] Note that if x < 0 then FX (x) = 0, and if x > 2 then FX (x) = 1. If 0 ≤ x ≤ 2 then Z x Z x x3 3 2 x − . fX (y)dy = fX (y)dy = FX (x) = 4 3 −∞ 0 As a result we can write: FX (x) = 0 3 4 2 x − x3 3 1 for x < 0, for 0 ≤ x ≤ 2, for x > 2. [Make sure that you define the cumulative distribution function for all real values and include FX (x) = 0 and FX (x) = 1 in the answer.] Straight from the c.d.f. we have P (X > 1) = 1−P (X ≤ 1) = 1−FX (1) = 1− 34 1 − 31 = 12 . 39 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES 0.4 0.5 (2.1) First find the marginal of Y by summing over the values of x to give: 0.0 0.1 0.2 f(y) 0.3 4/32 y = −2, 12/32 y = −1, pY (y) = 12/32 y = 0, 4/32 y = 1. −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 0.8 0.6 F(y) 0.4 0.2 0.0 Then the cumulative distribution function using FY (y) = P r(Y ≤ y) gives: 0 y < −2, 4/32 −2 ≤ y < −1, 16/32 −1 ≤ y < 0, FY (y) = 28/32 0 ≤ y < 1, 1 1 ≤ y. 1.0 y −4 −3 −2 −1 0 1 2 3 y [Notice that we must define the cdf for all real numbers even though we are only really interested in the central part. Marks would be lost for missing these “extreme” values.] For the conditional distribution, first find the marginal of X by summing over y to give: x pX (x) 1 8/32 2 8/16 3 8/32 Then use: pY |X (y|x) = p(x, y)/pX (x) to give: x 1 2 3 Value of y -2 -1 0 1/8 3/8 3/8 1/8 3/8 3/8 1/8 3/8 3/8 1 1/8 1/8 1/8 [Here the p.m.f.s are shown as tables – compare to (a) above – either approach is fine. Also each row of the conditional probabilities table is a probability distribution and so sums to 1.] Since the conditional distribution of Y given X = x does not depend on x (equivalently, the conditional distribution is equal to the marginal), then X and Y are independent. 40 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES (2.2) First the marginal of X by integrating over the variable we do not want, i.e. over y: 2.5 6 2 xy x + dy 2 0 7 2 xy 2 6 2 x y+ = 7 4 0 6 2 2x + x , 0 ≤ x ≤ 1. = 7 0.8 2 0.4 f(y) 0.0 0.5 0.2 1.0 f(x) 1.5 0.6 2.0 fX (x) = Similarly, fY (y) = 76 ( 13 + y4 ), 0 ≤ y ≤ 2. 0.0 Z −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 x 1.0 1.5 2.0 1.5 2.0 y x<0 1.0 0.6 F(x) 0.6 + 0≤x≤1 1 1<x h i 2 FY (y) = 76 y3 + y8 , 0 ≤ y ≤ 2. FX (x) = 0.0 0.2 0.4 F(y) x2 2 0.4 2x3 3 0.2 0.0 0 6 7 0.8 0.8 1.0 Now the cdf: −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 x 1.0 y The expectation, Z E[X] = 0 1 6 6 x (2x2 + x)dx = 7 7 1 6 2x4 x3 6 2 1 5 (2x + x )dx = = + + = . 7 4 3 0 7 4 3 7 3 0 2 Z Z 6 2 xy 6 1 2 4 x3 y x 2 y 2 E[X(X − Y )] = (x − xy) (x + )dydx = (x − − )dydx 7 2 7 0 0 2 2 0 0 2 Z Z 6 1 4 x3 y 2 x2 y 3 6 1 4 4x2 = − (2x − x3 − )dx x y− dx = 7 0 4 6 0 7 0 3 1 6 2x5 x4 4x3 53 = − − =− 7 5 4 9 0 210 Z 1 1 Z Z 2 2 Z E[X|Y = 1] = 1 Z xfX|Y (x|y)dx = 0 1 x 0 f (x, y) dx fY (y) so we must first evaluate the marginal density of Y , 1 Z 1 6 2 xy 6 x3 x2 y 6 1 y fY (y) = x + dx = + = + 2 7 3 4 0 7 3 4 0 7 and fY (1) = 12 , so Z E[X|Y = 1] = 0 1 1 6/7(x2 + x/2) 12 x4 x3 5 x dx = + = 1/2 7 4 6 0 7 41 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES (2.3) Let D be the event that the tested person has the disease and B the event that his/her test result is positive. Then, according to the information given in the question, we have P r(B|D) = 0.8, P r(B|Dc ) = 0.05, P r(D) = 0.004. Using Bayes formula, we obtain P r(D|B) = P r(D ∩ B) P r(B|D) · P r(D) = P r(B) P r(B|D) · P r(D) + P r(B|Dc ) · P r(Dc ) = 0.8 · 0.004 ≈ 0.0604. 0.8 · 0.004 + 0.05 · 0.996 Remark. This probability may look surprisingly small. An explanation may be as follows. Since 0.4% of the population actually have the disease, it follows that, on average, 40 persons out of every 10,000 will have it. The test will (on average) successfully reveal the disease in 40 · 0.8 = 32 cases. On the other hand, for 9,960 healthy persons, the test will state that about 9, 960 · 0.05 ≈ 498 of them are ‘ill’. Therefore, the test appears to be positive in about 32 + 498 = 530 cases, but the fraction of those who actually have the disease is approximately given by 32/530 ≈ 0.0604. (3.1) The exam results can be modelled by Bernoulli trials with probability of success p = 1/5 in part (a) and p = 1/3 in part (b). If X is the number of correct answers, then X has the distribution Bin(n = 20, p) with probabilities 20 k P r(X = k) = p (1 − p)20−k k = 0, . . . , 20. k Noting that 20% of 20 is 4, the probability to pass the exam is given by P r(X ≥ 4) = 1 − P r(X < 4) = 1 − 3 X P r(X = k). k=0 The results are shown in the table: p 1/5 1/3 P r(X = 0) P r(X = 0) P r(X = 2) P r(X = 3) P r(X ≥ 4) 0.0115 0.0576 0.1369 0.2054 0.5886 0.0003 0.0030 0.0143 0.0429 0.9396 giving the answers: (a) 0.5886 (b) 0.9396. (3.2) (a) For the exponential the c.d.f. is P r(X ≤ x) = FX (x) = 1 − e−λx x≥0 and so P r(X > x) = 1 − P r(X ≤ x) = 1 − 1 − e−λx = e−λx . Hence, with λ = 2, 1 1 Pr X > = e−2× 2 = e−1 = 0.3679 (4 s.f.) 2 42 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES (b) We require x such that FX (x) = 12 (note that this is the median), that is x such that 1−e−2x = 1 hence 21 = e−2x and − log 2 = −2x, so x = 12 log 2 = 0.3466 (4 s.f.) 2 (c) From the definition of conditional probability: 1 1 1 = Pr X > 1 ∩ X > /P r X > , Pr X > 1 | X > 2 2 2 but note that 1 (X > 1) ⊂ X > 2 and so 1 (X > 1) ∩ X > = (X > 1). 2 Hence we require e−2 P r(X > 1) e−2×1 = = = e−1 = 0.3679 (4 s.f.) 1 1 −1 −2× e Pr X > 2 2 e (3.3) The moment generating function of the Poisson distribution is found as follows. X (λet )x λx e−λ = e−λ x! x! t X (λet )x e−λe t t = eλ(e −1) = eλ(e −1) x! MX (t) = E[etX ] = X etx where the sum is of the P o(λet ) (check this) and so is equal to 1. [Note that, as with the derivation of the binomial m.g.f. here we could directly use the series expansion of the exponential.] Here differentiating and setting t = 0 gives: dMX (t) t = λet eλ(e −1) dt d2 MX (t) t t = λet eλ(e −1) + (λet )2 eλ(e −1) 2 dt and so E[X] = λ, E[X 2 ] = λ + λ2 . Hence V ar(X) = λ. The moment generating function of the Poisson distribution, Xi is t MXi (t) = eλi (e −1) and so, if Sk = X1 + ... + Xn , then the moment generating function of Sk is MSk (t) = k Y MXi (t) = i=1 k Y t P eλi (e −1) = e λi (et −1) i=1 with is the moment generating function of a Poisson random variable with parameter P P is Sk ∼ P o( λi ). And hence the mean and variance of Sk are both equal to λi . P λi , that [In the last two questions, we see the power of the moment generating function. We are producing important results without too much difficulty. 43 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES (4.1) The coding centres the x-values, meaning that P xi = 0, and hence x̄ = 0. The remaining calculations are as follows. Fig: Scatterplot of sugar remaining against Coded temp., x x2i xi y i coded temperature, with fitted equation. -0.5 0.25 -4.05 -0.4 0.16 -3.12 -0.3 0.09 -2.55 -0.2 0.04 -1.96 -0.1 0.01 -0.95 0.0 0.00 0.00 0.1 0.01 0.86 0.2 0.04 2.04 0.3 0.09 2.79 0.4 0.16 3.68 0.5 0.25 5.25 −0.4 −0.2 0.0 0.2 0.4 0.0 1.10 1.99 x P P Since y = 9.13 and Sxy /Sxx = xi yi / x2i = 1.81 (to 2 dp) we get the regression equation 8.0 8.5 9.0 y 9.5 10.0 10.5 Sugar, y 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5 100.4 sugar = 9.13 + 1.81 × coded temp where “sugar” is the sugar remaining after fermentation and “coded temp” is the fermentation temperature minus 21.5 degrees centigrade. Alternatively, we could give the regression equation as sugar = −29.77 + 1.81 × temp where “temp” is in degrees centigrade. Either form is correct, but you need to be clear as to which form you have used. (4.2) Re-arranging the model gives us εi = yi − α − β1 xi − β2 wi (i = 1, . . . , n). Hence the sum of squared errors is X X S= ε2i = (yi − α − β1 xi − β2 wi )2 . i i Differentiating S w.r.t. α, we get X ∂S = −2 (yi − α − β1 xi − β2 wi ) ∂α i = −2(ny − nα) P P since i xi = i wi = 0 as these variables are already centred. Setting this differential to zero when α = α b, we get α b = y (as in the one-predictor case). 44 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES To find βb1 , we set ∂S/∂β1 to zero when β1 = βb1 : X 0= xi (yi − α − βb1 xi − β2 wi ) i = Sxy − α X xi − βb1 Sxx − β2 Swx i ⇒ Sxy = βb1 Sxx + β2 Swx 1 (Sxy − β2 Swx ). ⇒ βb1 = Sxx This leaves us with one equation in two unknowns. To get round this, we substitute βb2 for β2 . Hence we need to find βb2 by differentiating S w.r.t. β2 to get βb2 = (Swy − βb1 Swx )/Sww . With this substitution, we get Swx Swy − βb1 Swx Sxx βb1 = Sxy − Sww 2 Swx Swy b Swx = Sxy − + β1 Sww Sww −1 2 Swx Swx Swy b ⇒ β1 = Sxx − Sxy − Sww Sww −1 2 Swx Sxy Swx Swy = 1− − Sww Sxx Sxx Sww Sxx as required. (5.1) Each child is male or female with some fixed (but unknown) probability, and we are considering families of five children so a suitable model is Binomial, X ∼ B(m = 5, p). We need also to assume independence of children within a family. To estimate the probability, p, use x̄/m = 0.5375. The corresponding fitted frequencies are: 6.8, 39.4, 91.5, 106.3, 61.8, 14.4; which are pretty close to the observed frequencies. (5.2) Let X be the daily log exchange rate, and we are told that X ∼ N (µ, σ 2 ) is a acceptable model. To estimate the unknown parameters we use µ b = x̄ = 0.134 and σ b = sn−1 = 0.2002. To test the given hypothesis√ we use the t-test (as the population variance is unknown) with test statistic √ tobs = (x̄ − µ0 )/(s/ n) = (0.134 − 0)/(0.2002/ 10) = 2.1. For a 5% test, the critical value is tcrit such that P r(Tn−1 > tcrit ) = 0.025 where T follows a t-distribution with n − 1 = 9 degrees of freedom. From the tables, P r(T9 > 2.262) = 0.025. In our case tobs is not greater that tcrit and hence there is not sufficient evidence to reject the null hypothesis. (From R the p-value is 0.06342, hence the same conclusion.) (6.1) From the statistical tables: (a) 0.9772, (b) P r(−1 < Z < 1) = P r(Z < 1) − P r(Z < −1) = P r(Z < 1) − (1 − P r(Z < 1)) = 2 × P r(Z < 1) − 1 = 2 × 0.8413 − 1 = 0.6826, and (c)P r(Z 2 > 3.8416) ≡ P r(−1.96 < Z < 1.96) = 2 × P r(Z < 1.96) − 1 ≈ 2 × P r(Z < 1.95) − 1 = 2 × 0.9744 − 1 = 0.9588. (Note that retaining 1.96 the answer is 0.95.) 45 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES (6.2) With the given MGF, MX (t) = exp µt + 12 σ 2 t2 . and using Result 4 in Section 3.2 with a = 1/σ and b = −µ/σ, then the MGF of X ∗ is 1 2 2 MX ∗ (t) = exp {−µt/σ} MX (t/σ) = exp {−µt/σ} × exp µt/σ + σ (t/σ) 2 2 1 t 1 2 = exp −µt/σ + µt/σ + σ 2 2 = exp t . 2 σ 2 Which is of the same form as the original MGF but with mean zero and unit variance, hence X ∗ ∼ N (0, 1) by the uniqueness of MGF. (6.3) Let X ∼ N (µ = 10, σ 2 = 25) and Z ∼ N (0, 1). To evaluate the probabilities stated in (a) and (b), we must first standardise X so we can refer to the standard normal table. From the lecture notes, we know that X = σZ + µ. So: x−µ x−µ P r(X 6 x) = P r(σZ + µ 6 x) = P r Z 6 =Φ , σ σ where Φ(z) = FZ (z) = P r(Z 6 z). (a) To evaluate P r(X 6 8), we first must standardise: 8 − 10 8−µ =Φ = Φ(−0.4). P r(X 6 8) = P r Z 6 σ 5 As Φ(−z) = 1 − Φ(z), Φ(−0.4) = 1 − Φ(0.4) = 1 − 0.6554 = 0.3446. (b) We can rewrite P r(15 6 X 6 20) in terms of the following cumulative probabilities: P r(15 6 X 6 20) = P r(X 6 20) − P r(X 6 15). Note: When considering continuous random variables, P r(X 6 x) = P r(X < x). The next step is to standardise: 20 − 10 20 − µ =Φ = Φ(2). P r(X 6 20) = P r Z 6 σ 5 15 − µ 15 − 10 P r(X 6 15) = P r Z 6 =Φ = Φ(1). σ 5 From the normal tables, we find that Φ(2) = 0.9772 and Φ(1) = 0.8413. Hence: P r(15 6 X 6 20) = 0.9772 − 0.8413 = 0.1359. (6.4) We are told that X ∼ Bin(n = 50, p = 0.52) and the normal approximation of this distribution is N (µ = np, σ 2 = np(1 − p)), so X is approximated by Y where Y ∼ N (µ = 26, σ 2 = 12.48). 46 I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES As we are approximating a discrete distribution with a continuous distribution, we must apply the continuity correction so that P r(X > 30) = P r(Y > 29.5). As in the previous question, we must standardise so we can use the normal tables. Again let Z ∼ N (0, 1), consider the symmetric property Φ(−z) = 1 − Φ(z) and note that P r(Y 6 y) = P r(Y < y) when we consider continuous distributions. Then 29.5 − 26 29.5 − µ =Φ √ = Φ(0.9907 . . .). P r(Y 6 29.5) = P r Z 6 σ 12.48 Using interpolation (as described beside the normal table): 0.9907 . . . − 0.95 Φ(0.9907...) ≈ 0.8289 + (0.8413 − 0.8289) = 0.8390. 1 − 0.95 Therefore P r(Y > 29.5) = 1 − 0.8390 = 0.161. (7.1) Here X ∼ U (0, 1), with y = − log(x/λ) (a monotonic transformation) so x = e−λy , |dx/dy| = | − λe−λy | = λe−λy , hence fY (y) = λe−λy , y ≥ 0 therefore Y ∼ exp(λ), that is Y has an exponential distribution with parameter λ. (7.2) Start with the moment generating function of the normal random variable Xi , MXi (t) = 1 2 2 exp µi t + 2 σi t . The moment generating function of Sk = X1 + ... + Xn is then k Y k Y 1 MSk (t) = MXi (t) = exp µi t + σi2 t2 2 i=1 i=1 X 1 X 2 2 σi )t = exp ( µi )t + ( 2 This is the P of a normal random variable with mean P moment generatingPfunction variance σi2 , hence Sk ∼ N ( µi , σi2 ). P µi and If now the random variables have equal mean and variance then this result becomes hence Sk ∼ N (nµ, nσ 2 ). Then, again using Result 4 in Section 3.2 with a = 1/n and b = 0, we have ( 2 ) t 1 2 t 1 σ2 2 MX̄ (t) = exp{0 × t} × MSk (t/n) = exp nµ + nσ t = exp µt + n 2 n 2 n which is the MGF of a normal random variable with mean µ and variance σ 2 /n hence X̄ ∼ N (µ, σ 2 /n). (8.1) With a single observation x from a Poisson distribution, l(θ) = f (x|θ) = θx e−θ /x! and the prior distribution of θ is Gamma(a, b), π(θ) = ba θa−1 e−bθ /Γ(a) θ > 0. Therefore, the posterior distribution of θ|x is π(θ|x) = f (x|θ) π(θ) f (x|θ) π(θ) =R f (x) f (x|θ) π(θ)dθ substituting gives π(θ|x) = R ba θa−1 e−bθ θx e−θ Γ(a) x! ba θa−1 e−bθ θx e−θ dθ Γ(a) x! 47 =R θx+a−1 e−(b+1)θ θx+a−1 e−(b+1)θ dθ I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES Note that the denominator (and numerator) are almost Gamma distribution – only the normalising constants are missing - with parameters a + x and b + 1. Adding the appropriate constants gives (b + 1)x+a θx+a−1 e−(b+1)θ π(θ|x) = R x+a θ x+a−1 e−(b+1)θ dθ Γ(x + a) (b+1) Γ(x+a) The integral in the denominator is that of a Gamma pdf over its full range and so have value 1. Hence, (b + 1)x+a θx+a−1 e−(b+1)θ π(θ|x) = Γ(x + a) that is a Gamma(a + x, b + 1) distribution. As the prior and posterior are both Gamma distributions, we have a conjugate prior here. 0.20 0.10 0.00 Density With values given, the prior is Gamma(2, 0.7) and the posterior Gamma(5, 1.7). These give estimates: 2.9 and 2.4 respectively. The graph shows that the posterior density (dashed line) is more concentrated than the prior (solid line), and that the mean and mode have increased, compared to the prior values, due to the higher data value. 0.30 For a Gamma(α, λ) the posterior mean is α/λ and the MAP (the posterior mode) is (α − 1)/λ. Substituting in α = a + x and λ = b + 1 gives: the posterior mean (a + x)/(b + 1) and MAP (a + x − 1)/(b + 1). 0 2 4 6 8 10 x (8.2) For this example the prior is Beta with pdf π(θ) = θα−1 (1 − θ)β−1 /B(α, β) the data has a Binomial distribution: X|θ ∼ Binomial(n, θ) 0 < θ < 1 and notice that this is almost the same as one of the class examples, and hence here we will take the approach of only looking at the functional form of the posterior, that is ignoring constants. So the posterior is: n x π(θ|x) ∝ f (x|θ) π(θ) ∝ θ (1 − θ)n−x θα−1 (1 − θ)β−1 ∝ θx+α−1 (1 − θ)n−x+b−1 x Thus, θ|x ∼ Beta(x + α, n − x + β). The mean of a Beta(α, β) distribution is α and therefore, the posterior mean is α+β E[θ|x] = x+α n+α+β Given n = 25, x = 8, and that prior has mean 21 , standard deviation 14 . For Y ∼ Beta(α, β), E[Y ] = α α+β and V ar[Y ] = 48 αβ (α + β)2 (α + β + 1) I NTRODUCING P ROBABILITY AND S TATISTICS S OLUTIONS TO EXERCISES Therefore, α 1 = α+β 2 αβ and (α + β)2 (α = + β + 1) 1 16 From the first of these, α = β. Substituting this into the second gives, 16α2 = 4α2 (2α + 1) Thus, α = 3 2 ⇒ 4 = 2α + 1 (α 6= 0) = β. For the above values, the posterior mean is 8 + 23 19 x+α = = = 0.3393 µ b = E[θ|x] = n+α+β 25 + 3 56 An estimate of the precision of µ b is obtained by calculating the standard deviation of θ|x. If the standard deviation is small, then µ b is a precise estimate whereas if the standard devation is large, then µ b is not a precise estimate. Here, (8 + 32 )(17 + 32 ) αβ = = 0.007730 V [Y ] = (α + β)2 (α + β + 1) 28 × 28 × 29 Thus, the standard deviation is 0.0879. (8.3) Here the prior is Gamma, θ ∼ γ(a, b) with pdf π(θ) = ba θa−1 e−bθ Γ(a) θ, a, b > 0, and the data has an exponential distribution: X|θ ∼ exp(θ) with pdf f (x|θ) = θ exp{−θx} x ≥ 0, θ > 0. Again we will take the approach of only looking at the functional form of the posterior, that is ignoring constants. So the posterior is: π(θ|x) ∝ f (x|θ) π(θ) ba θa−1 e−bθ ∝ θ(a+1)−1 e−(b+x)θ Γ(a) a+1 (a+1)−1 −(b+x)θ (b + x) θ e π(θ|x) = Γ(a + 1) ∝ θe−θx Thus, θ|x ∼ γ(a + 1, b + x). Recall that for a γ(α, β), the mean is α/β, the mode is (α − 1)/β and the variance is α/β 2 . Therefore, the posterior mean is: θ̄ = a+1 . b+x With the given numbers this is (10 + 1)/(1.5 + 4.8) = 1.746, and the MAP estimate is: θb = [(a − 1) + 1]/(b + x) = 10/(1.5 + 4.8) = 1.587. The posterior standard deviation is: √ √ a + 1/(b + x) = 10 + 1/(1.5 + 4.8) = 0.526 which is large compared to the estimates so the estimates are not precise. 49 I NTRODUCING P ROBABILITY AND S TATISTICS S TANDARD D ISTRIBUTIONS 1. A Bernoulli random variable, X, with parameter θ has probability mass function p(x; θ) = θx (1 − θ)1−x x = 0, 1 (0 < θ < 1), and mean and variance E[X] = θ and Var[X] = θ(1 − θ). 2. A geometric random variable, X, with parameter θ has probability mass function p(x; θ) = θ(1 − θ)x−1 x = 1, 2, . . . (0 < θ < 1), and mean and variance E[X] = 1/θ and Var[X] = (1 − θ)/θ2 . 3. A negative binomial random variable, X, with parameters r and θ has probability mass function x−1 r p(x; r, θ) = θ (1 − θ)x−r r−1 x = r, r + 1, . . . (r > 0 and 0 < θ < 1), and mean and variance E[X] = r/θ and Var[X] = r(1 − θ)/θ2 . 4. A binomial random variable, X, with parameters n and θ (where n is a known positive integer has probability mass function n x p(x; n, θ) = θ (1 − θ)n−x x x = 0, 1, . . . , n (0 < θ < 1), and mean and variance E[X] = nθ and Var[X] = nθ(1 − θ). 5. A Poisson random variable, X, with parameter θ has probability mass function p(x; θ) = θx e−θ x! x = 0, 1, . . . (θ > 0), and mean and variance E[X] = θ and Var[X] = θ. 6. A uniform random variable, X, with parameter θ has probability density function f (x; θ) = 1 θ 0 < x < θ, (θ > 0), and mean and variance E[X] = θ/2 and Var[X] = θ2 /12. 7. An exponential random variable, X, with parameter λ has probability density function f (x; λ) = λe−λx x>0 (λ > 0), and mean and variance E[X] = 1/λ and Var[X] = 1/λ2 . 8. A normal random variable, X, with parameters µ and σ 2 has probability density function 1 1 2 2 f (x; µ, σ ) = √ exp − 2 (x − µ) −∞ < x < ∞ (−∞ < µ < ∞, σ 2 > 0), 2 2σ 2πσ and mean and variance E[X] = µ and Var[X] = σ 2 . 50 I NTRODUCING P ROBABILITY AND S TATISTICS S TANDARD D ISTRIBUTIONS 9. A gamma random variable, X, with parameters α and β has probability density function f (x; α, β) = β α xα−1 e−βx Γ(α) x>0 (α, β > 0), R∞ xα−1 e−x dx, and mean and variance E[X] = α/β and Var[X] = α/β 2 . Note √ that Γ(α + 1) = αΓ(α) for all α and Γ(α + 1) = α! for integers α > 1. Also Γ(1/2) = π. where Γ(α) = 0 10. A beta random variable, X, with parameters α and β has probability density function f (x; α, β) = where B(α, β) = R1 0 xα−1 (1 − x)β−1 B(α, β) 0<x<1 (α, β > 0), xα−1 (1 − x)β−1 dx = Γ(α)Γ(β)/Γ(α + β), and mean and variance E[X] = α/(α + β) and Var[X] = αβ/{(α + β)2 (α + β + 1)}. 11. A Pareto random variable, X, with parameters θ and α has probability density function αθα xα+1 f (x; θ, α) = and mean and variance E[X] = αθ (α−1) x>θ and Var[X] = (θ, α > 0), αθ2 (α (α−1)2 (α−2) > 2). 12. A chi-square random variable, X, with degrees of freedom parameter n (n is a positive integer) has probability density function n2 n −1 − x 1 x2 e 2 f (x; n) = 2 Γ( n2 ) x>0 and mean and variance E[X] = n and Var[X] = 2n. 13. A Student’s t random variable, X, with degrees of freedom parameter n (n is a positive integer) has probability density function Γ( n+1 ) 2 f (x; n) = √ nπ Γ( n2 ) 1 + x2 n −∞<x<∞ n+1 2 and mean and variance E[X] = 0 (n > 1) and Var[X] = n/(n − 2) (n > 2). 14. An F random variable, X, with degrees of freedom parameters m and n (m, n are positive integers) has probability density function f (x; m, n) = and mean and variance E[X] = m m2 n n (n n−2 m Γ( m+n ) x 2 −1 2 m+n 2 Γ( m2 )Γ( n2 ) 1 + mx n > 2) and Var[X] = 51 x>0 2n2 (m+n−2) (n m(n−2)2 (n−4) > 4). I NTRODUCING P ROBABILITY AND S TATISTICS N ORMAL TABLES Normal Distribution Function Tables The first table gives 1 Φ(x) = √ 2π Z x 1 2 e− 2 y dy −∞ and this corresponds to the shaded area in the figure to the right. Φ(x) is the probability that a random variable, normally distributed with zero mean and unit variance, will be less than or equal to x. When x < 0 use Φ(x) = 1−Φ(−x), as the normal distribution with mean zero is symmetric about zero. For interpolation use the formula x − x1 Φ(x) ≈ Φ(x1 ) + Φ(x2 ) − Φ(x1 ) x2 − x1 (x1 < x < x2 ) Table 1 x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.5000 0.5199 0.5398 0.5596 0.5793 0.5987 0.6179 0.6368 0.6554 0.6736 0.6915 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.6915 0.7088 0.7257 0.7422 0.7580 0.7734 0.7881 0.8023 0.8159 0.8289 0.8413 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 0.8413 0.8531 0.8643 0.8749 0.8849 0.8944 0.9032 0.9115 0.9192 0.9265 0.9332 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 0.9332 0.9394 0.9452 0.9505 0.9554 0.9599 0.9641 0.9678 0.9713 0.9744 0.9772 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 0.9772 0.9798 0.9821 0.9842 0.9861 0.9878 0.9893 0.9906 0.9918 0.9929 0.9938 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 3.00 0.9938 0.9946 0.9953 0.9960 0.9965 0.9970 0.9974 0.9978 0.9981 0.9984 0.9987 The inverse function Φ−1 (p) is tabulated below for various values of p. Table 2 p Φ−1 (p) 0.900 1.2816 0.950 1.6449 0.975 1.9600 52 0.990 2.3263 0.995 2.5758 0.999 3.0902 0.9995 3.2905