Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 1: MATH11400 Oliver Johnson: [email protected] Twitter: @BristOliver School of Mathematics, University of Bristol Teaching Block 2, 2017 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 1 / 279 Course outline 10 chapters, 20 lectures, 10 weekly problem sheets Notes have gaps, model solutions will only be handed out on paper: IT IS YOUR RESPONSIBILITY TO ATTEND LECTURES AND TO ENSURE YOU HAVE A FULL SET OF NOTES AND SOLUTIONS Course webpage for notes, problem sheets, links etc: https://people.maths.bris.ac.uk/∼maotj/teaching.html Datasets for the course https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData Drop-in sessions: Tuesdays 1-2. Just turn up to Room 3.17 in these times. (Other times, I may be out or busy - but just email [email protected] to fix an appointment). This material is copyright of the University unless explicitly stated otherwise. It is provided exclusively for educational purposes at the University and is to be downloaded or copied for your private study only. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 2 / 279 Contents 1 Introduction 2 Section 1: Basics of data analysis 3 Section 2: Parametric families and method of moments 4 Section 3: Likelihood and Maximum Likehood Estimation 5 Section 4: Assessing the Performance of Estimators 6 Section 5: Sampling distributions related to the Normal distribution 7 Section 6: Confidence intervals 8 Section 7: Hypothesis Tests 9 Section 8: Comparison of population means 10 Section 9: Linear regression 11 Section 10: Linear Regression: Confidence Intervals & Hypothesis Tests Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 3 / 279 Textbooks The recommended textbook for the unit is: Mathematical statistics and data analysis by JA Rice This covers both Probability and Statistics material and some of the second year Statistics unit. It combines modern ideas of data analysis, using graphical and computational techniques, with more traditional approaches to mathematical statistics. The statistical package R will be used to illustrate ideas in lectures and you will need to use it for set work. The notes and handouts should provide sufficient information, but a good introductory text for further reading is: Introductory Statistics with R by Peter Dalgaard It will be particularly useful for students who intend to continue studying statistics in their second, third (and fourth) year. Other books are listed on the course web page. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 4 / 279 Section 1: Basics of data analysis Aims of this section: This section introduces a selection of simple graphical and numerical methods for exploring and summarizing single data sets. These methods generally form part of an approach called Exploratory Data Analysis. Such analysis and evaluation can be informative in its own right. It also forms an essential first step before any detailed statistical analysis is performed on the data. The section also introduces the statistical package R through its use for simple graphical and numerical computation of plots and summary statistics. Suggested reading: Rice Oliver Johnson ([email protected]) Sections 10.1-10.6 Statistics 1: @BristOliver c TB 2 UoB 2017 5 / 279 Objectives: by the end of this section you should be able to Construct simple graphical plots of data sets (stem-and-leaf plot, histogram, boxplot and (if appropriate) time-plot). Use simple graphical plots to comment on the overall pattern of data in a data set, and identify and comment on any striking deviations from this pattern. Calculate simple measures of location for a data set (median, mean and trimmed mean). Calculate simple measures of spread for a data set (variance, standard deviation, hinges, quartiles and inter-quartile range). Use the statistical package R to produce simple graphical plots and compute simple measures of location and spread for a given set of real-valued data. Compute the order statistics for a given set of real-valued data. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 6 / 279 Section 1.1: Our framework Definition 1.1. Many statistical problems can be described by a simple framework of: a of objects a real-valued population associated with each member of the some quantity of interest determined by the overall distribution of X values in the population a sample of n members of the population a {x1 , . . . , xn } of observed values of X for the sampled members. The key problem is to infer the unknown value of the population quantity from the known sample data. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 7 / 279 Motivating Example Example 1.2. the is ‘all students graduating from Bristol in 2016’, the variable is ‘debt of each student at the time of graduation’, the quantity of interest is the average debt in the population, the is ‘100 students chosen randomly from email database’, the data set {x1 , . . . , x100 } is their individual level of debt. Example 1.3. the population is all lightbulbs made by Firm X, the is the lifetime of each lightbulb, the quantity of interest is the proportion of lifetimes exceeding 2 years. the is ‘1000 lightbulbs fitted in 2013, checked in 2015’ the data set x1 , . . . , x1000 is ‘the date that they failed’ (some might not have failed yet) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 8 / 279 More general settings are considered in later chapters May have sample data from several populations and want to determine if there is any pattern in in quantity of interest between populations. I I e.g. given data on debt for a sample of students from several universities, and want to explore how the average debt from university to university. e.g. given data on lifetimes of a sample of lightbulbs from several manufacturers, and want to explore how the proportion of lifetimes exceeding varies from manufacturer to manufacturer. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 9 / 279 Simple random samples Definition 1.4. We say that sample data values x1 , . . . , xn are the observed values of a of size n from the population if each sample member is chosen members, and each population member is of the other sample to be included in the sample. Note that this can be hard to achieve in practice. Remark 1.5. For simple random samples, data values are representative of values in the population as a whole. On average, different values occur in the sample in the same proportion as they occur in the population. Thus we can use data values from (possibly small) sample to make about values in the population as a whole. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 10 / 279 Section 1.2: Exploratory Data Analysis (EDA) Exploratory Data Analysis refers to a collection of techniques for exploration of a data set. EDA can help check that data is compatible with assumptions that: 1 2 3 observations are independent, observations all come from a distribution, the distribution is a particular type (e.g. normal, exponential). EDA can give simple direct estimates of some population quantities, without assuming any particular type of distribution. Features of EDA which we will use include: Numerical summaries of centre or location of data (Section 1.4) I median, mean, mean Numerical summaries of spread of data (Section 1.5) I variance, standard deviation, , quartiles and inter-quartile-range Initial graphical plots (Section 1.6) I stem-and-leaf plots, histograms (or bar charts), time plots, boxplots Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 11 / 279 Section 1.3: Random samples and order statistics Definition 1.6. Write {x1 , . . . , xn } for a data set where the n values are arranged in time order – for example, the order of or recording. The first value seen is x1 , . . . , the last seen is xn . Definition 1.7. Write {x(1) , . . . , x(n) } for the of the sample – that is, the data set rearranged so the values are increasing in size. The smallest (minimum) value seen is x(1) . The largest (maximum) is x(n) . We say that x(i) has rank i. We can find x(i) if we know all xi . We cannot find xi if we know all x(i) . If xi observations of independent identically distributed (IID) random variables, x(i) are neither independent nor identically distributed. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 12 / 279 Section 1.4: Measures of centre or location of data Definition 1.8. The is the ‘middle observation’ in the ranked order x(1) ≤ x(2) ≤ . . . ≤ x(n) . If n = 2m + 1 is odd, the median is . If n = 2m is even, the median is the average (x(m) + x(m+1) )/2. Pro: not sensitive to extreme values. Con: not easy to calculate when combining two samples. Definition 1.9. Sample defined as x = (x1 + . . . + xn )/n. Equivalently, x = (x(1) + . . . + x(n) )/n. Pro: easy to calculate, easy to combine two samples, easy to derive statistical properties. Con: sensitive to extreme values. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 13 / 279 Trimmed Sample Mean Definition 1.10. We define the ∆% mean as follows: First take k = bn∆/100c, where b·c denotes Remove smallest values and largest . values of the sample. Calculate the sample mean of the remaining values. This is harder to calculate, but less sensitive to extreme values. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 14 / 279 Section 1.5: Numerical measures of range or spread of data Definition 1.11. 2 is defined as s = Pn j=1 (xj − x)2 . n−1 Equivalently (easier to calculate – check they are equivalent!!) 2 s = Pn 2 j=1 xj − P n j=1 xj n−1 2 /n = Pn 2 j=1 xj − nx 2 n−1 . NB: Please note normalization of second term in numerator. √ Sample s = s 2. s 2 represents spread: s 2 large means large spread around x, small s 2 means small spread. Logic for dividing by (n − 1): sum of (xj − x) is zero, so only independent values of (xj − x). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 15 / 279 Hinges and quartiles Definition 1.12. Want to divide sample in 4 roughly equal parts. Lower hinge H1 is median of set {data values with ≤ rank of sample median}. Upper hinge H3 is median of set {data values with rank ≥ rank of sample median}. Definition 1.13. Related idea is that of : definitions vary slightly. Rice defines Q1 to be data value with rank (n + 1)/4 and Q3 to be data value with rank 3(n + 1)/4. That is Q1 = x((n+1)/4) etc. If these ranks aren’t integers, we will interpolate. Different authors/packages interpolate in different ways (see sheet ). For large samples, H1 ' Q1 and H3 ' Q3 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 16 / 279 IQR, outliers, skewness Definition 1.14. A number summary can be calculated from the data – give the median, the upper and lower hinges and the maximum and minimum. These numbers roughly divide the data into four equally-sized groups. Definition 1.15. Interquartile range: IQR = Q3 − Q1 measures spread around median. In this course, we define to be points more than 1.5 × (H3 − H1 ) ' 1.5 × IQR away from the hinges. We say a distribution is to the right if the histogram has long right tail, that is if H3 − median > median − H1 . Distribution is skewed to the left if H3 − median < median − H1 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 17 / 279 Example calculations Example 1.16. If data is seen in order then x1 = 3, x2 = 1, x3 = 5, x4 = 9, x5 = 0, x6 = −1. x(1) = −1, x(2) = 0, x(3) = 1, x(4) = 3, x(5) = 5, x(6) = 9. n = 6, so median is (x(3) + x(4) )/2 = (1 + 3)/2 . x = (−1 + . . . + 9)/6 = . The 20% trimmed mean found since k = b6 ∗ 20/100c = b1.2c = 1. Hence, discarding 1 largest and 1 smallest value gives 20% trimmed mean P of (0 + 1 + 3 + 5)/4 . Since j xj2 = 32 + 12 + 52 + 92 + 02 + (−1)2 = 117, so that s 2 = (117 − 172 /6)/(6 − 1) = (117 − 289/6)/5 = 13.77. H1 = 0, median of {−1, 0, 1}. H3 = 5, median of {3, 5, 9}. (n + 1)/4 = 7/4, so take Q1 = x(1) + 3/4(x(2) − x(1) ) = −1 + 3/4(0 − (−1)) = −1/4. 3(n + 1)/4 = 21/4, so take Q3 = x(5) + 1/4(x(6) − x(5) ) = 6. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 18 / 279 Section 1.6: Initial Graphical Plots In graphical plots, should check: I I I I the overall of variation within the data (e.g. symmetric, skew, bi-modal) any unusual features within a pattern or striking deviations from a pattern (outliers) whether any features are just random occurrences or are systematic features any evidence of or granularity (data clumping at certain sequences of values, reflecting measurement scale) For more than one variable, first examine each variable by itself, then study relationships between variables. For data from more than one population, compare variation within each data set with variation between data sets – use numerical (e.g. summary statistics) or graphical (e.g. boxplots) summaries Best to generate graphical plots using R – powerful language for statistics (and applications) Typing data() will give a list of available datasets. Course dataset is online Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 19 / 279 EDA (a): Stem-and-leaf plots in R – command stem Example 1.17. > load(url("https://people.maths.bris.ac.uk/~maotj/teach/stats1.RData")) > sort(quakes) [1] 9 30 33 36 38 40 40 44 46 76 82 83 92 [14] 99 121 129 139 145 150 157 194 203 209 220 246 263 [27] 280 294 304 319 328 335 365 375 384 402 434 436 454 [40] 460 556 562 567 584 599 638 667 695 710 721 735 736 [53] 759 780 832 840 887 937 1336 1354 1617 1901 > stem(quakes) The decimal point is 2 digit(s) to the right of the | 0 | 133444445888902345569 2 | 01256890234788 4 | 034566678 6 | 0470124468 8 | 3494 10 | 12 | 45 14 | 16 | 2 18 | 0 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 20 / 279 Formal description of Stem-and-Leaf Plot 1 If necessary, truncate or round the data values so that all the variation is in last two or three significant digits. 2 Separate each data value into a except the rightmost), and a 3 Write the stems in a vertical column – smallest at the top – and draw a separator (e.g. a vertical line) to the right of this column. 4 Write each leaf in the row to the of the corresponding stem, in increasing order out from the stem. 5 Record any strikingly low or high values separately from the main stem, displaying the individual values in a group above the main stem (low values) or below it (high values). Oliver Johnson ([email protected]) (consisting of all the digits (the rightmost digit). Statistics 1: @BristOliver c TB 2 UoB 2017 21 / 279 Earthquakes stem-and-leaf plot example (cont.) Example 1.17. R decided to put the decimal point digits to the right of the | (the bar) and use a scale where each stem corresponds to intervals of 200 days – and hence each leaf corresponds to an interval of 10 days. Thus the first line represents (rounded) data values of 10, 30, 30, 40, 40 ,. . . , 90, 100, 120, . . . , 190 in that order. Because of the and small number of data values, it can be difficult to tell that the last line, for example, represents the data value 1900 rather than 1800. Change scale on which data is displayed using e.g. stem(quakes, scale=2) which produces a scale where each stem corresponds to days. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 22 / 279 Earthquakes stem-and-leaf plot example (cont.) Example 1.17. > stem(quakes, scale=2) The decimal point is 2 digit(s) to the right of the | 0 | 1334444458889 1 | 02345569 2 | 0125689 3 | 0234788 4 | 03456 5 | 6678 .. . 18 | 19 | 0 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 23 / 279 EDA (b): Histograms in R – command hist Example 1.18. Produce histogram of earthquake times using hist(quakes) The standard histogram on the left, a customised version on the right. This is done using par(mfrow= c(1,2)). Earthquakes − maotj 0.0015 Density 0.0000 0 0 500 1500 quakes Oliver Johnson ([email protected]) 0.0010 0.0005 10 5 Frequency 15 0.0020 20 Histogram of quakes 0 500 1500 Time between earthquakes in days Statistics 1: @BristOliver c TB 2 UoB 2017 24 / 279 Formal description of histogram 1 Divide the range of data values into K intervals (cells or ) of equal width. If width is too large, the plot may be too coarse to see the details of any pattern; if too small many cells may have just one or two observations. 2 Count the number (frequency) or the percentage of observations falling into each interval. Be consistent with the allocation of values that equal the end points of intervals. 3 Display a plot of joined columns or bars above each interval, with height proportional to the or percentage for that interval. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 25 / 279 Customising histograms in R Example 1.18. The plots can be customised using sub-commands, such as: I I I I freq=FALSE to display (i.e. proportions) rather than counts by specifying breaks to give a certain number of or to give cells of a desired width; by adding titles using main="Plot name - Your ID"; labelling axes using xlab="Label for X axis", similarly ylab. ALWAYS ADD YOUR OWN ID before printing a plot. For example we can customise the histogram above as follows: > hist(quakes, breaks=seq(0,2000,100), freq=FALSE, xlab="Time between earthquakes in days", main="Earthquakes - maotj") gives histogram of proportions rather than frequencies; breaks between cells form a sequence starting at 0 days, finishing at 2000 days, and of length 100 days apart; adds a label to the x-axis; and revises the title of the plot. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 26 / 279 EDA (c): Time Plots in R – command plot A plot of the data in the order it was recorded may give valuable information when values represent outcomes of repetitions of a single statistical repeated over time. In R this is obtained with command plot, e.g. > load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) > plot(quakes) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 27 / 279 EDA (d): Boxplot and five number summary A boxplot is a very simple graphical summary of a data set, devised by John Tukey, based on the five number summary (see Definition 1.14). The plot consists of a box with I I I top drawn level with the value of the upper hinge, bottom drawn level with the value of the lower hinge, horizontal line drawn across the middle, level with the median. Vertical lines (sometimes called ) are drawn from the top of the box to a point level with the maximum value, and from the bottom of the box to a point level with the minimum value. If there are any , then the whiskers are drawn to the largest data value within 1.5 × IQR from the corresponding hinge and the remaining outlier data points are plotted individually. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 28 / 279 Creating boxplots in R In R a boxplot and the corresponding five numbers on which it is based can be obtained with the commands boxplot and fivenum. For example, for the dataset quakes > boxplot(quakes) > fivenum(quakes) A typical boxplot looks something like the following plot: Individual values of outliers Largest value within 1.5 IQR of Upper Hinge Upper Hinge Median Lower Hinge Smallest value within 1.5 IQR of Lower Hinge Individual values of outliers Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 29 / 279 Section 1.7: Newcomb data example Example 1.19. Newcomb data sample refers to attempts to measure speed of light. Data values record each time in terms of deviations from a standard time of 0.000024800 seconds - so a value of 28 indicates a recorded time of 0.000024828 seconds etc > load(url("https://people.maths.bris.ac.uk/~maotj/teach/stats1.RData")) > newcomb [1] 28 26 33 24 34 -44 [20] 19 24 20 36 32 36 [39] 30 22 36 23 27 27 [58] 25 32 25 29 27 28 27 28 28 29 16 25 27 16 40 -2 29 22 24 21 25 30 23 29 31 21 28 29 37 25 28 26 30 32 36 26 31 27 26 33 26 32 32 24 39 28 24 23 > plot(newcomb, main="Plot of Newcomb data - maotj") > hist(newcomb, breaks=seq(-45, 45,2.5), + main="Histogram of Newcomb data - maotj") Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 30 / 279 Example 1.19. 0 −40 newcomb 40 Plot of Newcomb data − maotj 0 10 20 30 40 50 60 Index 10 5 0 Frequency Histogram of Newcomb data − maotj −40 −20 0 20 40 newcomb Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 31 / 279 Numerical summaries Example 1.19. We can use R commands median(), mean(), summary(), var(), sd() and IQR() to calculate numerical summaries. > median(newcomb) [1] 27 > mean(newcomb) [1] 26.21212 > mean(newcomb, trim = 0.1) [1] 27.42593 > mean(newcomb, trim = 0.2) [1] 27.35 > summary(newcomb) Min. 1st Qu. Median Mean 3rd Qu. -44.00 24.00 27.00 26.21 30.75 > var(newcomb) [1] 115.462 > sd(newcomb) [1] 10.74532 > IQR(newcomb) [1] 6.75 Oliver Johnson ([email protected]) Max. 40.00 Statistics 1: @BristOliver c TB 2 UoB 2017 32 / 279 Graphical summary Example 1.19. We can also produce a graphical summary of the data with the command boxplot(), for which the relevant numerical values are given by the command fivenum(). > boxplot(newcomb, main="Boxplot of Newcomb data - maotj") > fivenum(newcomb) [1] -44 24 27 31 40 The boxplot scale can be distorted by the presence of outliers. We can produce a boxplot of the newcomb data set which excludes the outliers (the 6th and 10th data values) with the command boxplot(newcomb[-c(6,10)]) The standard boxplot is shown on the left below, while a boxplot without outliers is shown on the right. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 33 / 279 Boxplot with and without outliers Example 1.19. 40 35 −40 20 −20 25 0 30 20 40 Boxplot of Newcomb data − maotj Boxplot of Newcomb data − maotj Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 34 / 279 Section 2: Parametric families and method of moments Aims of this section: This section introduces the idea of modelling the distribution of a variable in a population in terms of a family of parametric distributions, where the parameters of the distribution relate to specific quantities of interest in the population, such as the population mean or variance. If our model is appropriate for the data, then we can make inferences about the population from which data was obtained simply by estimating the parameters for the distribution. The section starts by introducing one of the simplest methods of parametric estimation - the method of moments - but note that other methods (maximum likelihood methods and least squares methods) will be introduced in later sections. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 35 / 279 Aims of this section: For discrete observations representing counts, the correct choice of parametric model is often determined by the context. However, for continuous uni-modal data, the correct choice of model can be more difficult. There are often a number of feasible models whose distributions appear at first glance to similar while differing in essential details. In the second half of the section we introduce probability plots (plots of the quantiles of the data against the quantiles of the fitted distribution) way of assessing the fit of the data to the chosen parametric family. Rice Suggested reading: Rice Rice Oliver Johnson ([email protected]) Sections 8.1-8.4 Section 10.2.1 Section 9.9 Statistics 1: @BristOliver c TB 2 UoB 2017 36 / 279 Objectives: by the end of this section you should be able to Understand how parametric families can be used to provide flexible models for the distribution of population random variables. Recall the formulae for the probability mass function and the probability density function for the standard parametric families of discrete and continuous distributions (Binomial, Geometric, Poisson, Exponential, Gamma, Normal, Uniform) and be able to relate the parameters of each distribution to population quantities such as the population mean and the population variance. Write down equation(s) defining the method of moments estimators for parametric families defined in terms of one or more unknown parameters. Solve the equations for the methods of moments estimators in simple one and two parameter cases, and hence compute appropriate methods of moments estimates from a given data set. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 37 / 279 Objectives: by the end of this section you should be able to Use R to numerically calculate the quantiles of simple standard distributions (Uniform, Exponential, Gamma, Normal). Construct simple probability plots of the quantiles of a data set against the quantiles of a given or fitted distribution (by hand or in R ), and use the plot to assess the fit of the data to the specified distribution. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 38 / 279 Section 2.1: Parametric Models Definition 2.1. When X is a variable and population size is , some probability density function fX (x) may give a reasonable, if idealised, model of the frequency of X values in the population. Call X the population random variable and call fX (x) the population probability density function (pdf). Similarly call pX (x) the population probability mass function (pmf) if X is discrete. Although we do not know the population distribution, we often have theory, experience or data suggesting a certain of probability density function is appropriate for the population in question. Example 2.2. Inspection of the data may lead us to believe the earthquake times in Section 1.6 come from an Exponential(θ) distribution and the Newcomb times in Section 1.7 come from a Normal(µ, σ 2 ) distribution. We would like to estimate the population parameters θ, µ and σ 2 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 39 / 279 Definition of parametric families Definition 2.3. A is a collection of distributions of the same type which differ only in the value of one (or more) parameter, say θ. In other words, the form of the distribution is a of θ. Write fX (x; θ) for the pdf in the parametric family corresponding to the parameter θ. Write E(X ; θ) for the mean of the corresponding distribution etc. A summary sheet of parametric families and graphs comparing probability density functions is provided in the appendix. Discrete: Bernoulli(θ), Binomial(K , θ), Geometric(θ), Poisson(θ) Continuous: U(0, θ), Exp(θ), Γ(α, β), N(µ, σ 2 ). Other parametric families such as the Lognormal, the Pareto and the Weibull families can, for example, provide better models of skewed data populations. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 40 / 279 Estimation Definition 2.4. b 1 , . . . , xn ) (usually abbreviated to θ) b is our ‘best An θ(x guess’ for the unknown parameter, based on the data {x1 , . . . , xn }. Sometimes refer to function (x1 , . . . , xn ) → θb as ‘estimator’. Remark 2.5. θ is the true, fixed but value. We hope that θb will be close to θ in some sense. However θb depends on the data, so is itself . b IT IS VITAL TO USE THE NOTATION θ AND θ CORRECTLY: THEY ARE NOT INTERCHANGABLE! Will distinguish between methods of estimating (method of moments, maximum likelihood, least squares) with subscripts if necessary. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 41 / 279 Estimating quantities of interest Remark 2.6. Although the original ‘quantity of interest’ will not necessarily correspond to θ itself, it must be a function of θ, say . One way to estimate τ (θ) is to first estimate θ, then to plug the b estimate θb into the expression τ (θ) to get the estimate τ (θ). See Example 3.10 below for more details. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 42 / 279 Section 2.2: Sampling from parametric families From now onwards, at the start of each problem we will that an appropriate parametric family has been identified. Family has probability density function f (x; θ) (or probability mass function p(x; θ) if ). Can have single unknown parameter θ, or in general k unknown parameters. Definition 2.7. We assume data values x1 , . . . , xn are the observed values of a simple random sample of size n from the population represented by the parametric family. i.e. assume X1 , . . . , Xn independent, identically distributed ∼ f (x; θ). For simple random samples, Remark 1.5 explains that the data values are representative of the values in the population as a whole. Thus we can use data values from the (possibly small) sample to make inferences about the values in the population as a whole. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 43 / 279 Joint probability density of simple random sample Lemma 2.8. For a simple random sample from a distribution in a parametric family, fX1 ,...,Xn (x1 , . . . , xn ; θ) = n Y fX (xi ; θ). i=1 Proof. Since X1 , . . . , Xn are independent, their joint probability density function factorises as the of the density functions: fX1 ,...,Xn (x1 , . . . , xn ; θ) = fX1 (x1 ; θ)fX2 (x2 ; θ) · · · fXn (xn ; θ). X1 , . . . , Xn are identically distributed with the same distribution as X , so each marginal density function has the same form as the density function for X , i.e. for all i: fXi (xi ; θ) = fX (xi ; θ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 44 / 279 Section 2.3: Method of moments estimation Definition 2.9. Assume the random variable X comes from a parametric family with parameter θ. For k = 1, 2, 3, . . . we call E(X k ; θ) the kth population moment (i.e. the average value of X k in the population). Hence E(X ; θ) is the first population etc.R ∞ Can look up E(X ; θ) in Appendix or calculate as −∞ x k f (x; θ)dx. Definition 2.10. For a sample with data values {x1 , . . . , xn }, for k = 1, 2, 3, . . . define the kth sample moment by mk = x1k + x2k + . . . + xnk . n Hence mk is the average value of x k in the . Hence m1 = (x1 + . . . + xn )/n is the sample mean etc. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 45 / 279 Method of moments estimation: motivation If data comes from a simple random sample, the sample values are representative of population values. Suggests population moment ' sample moment . Definition 2.11. Given a sample {x1 , . . . , xn } from a I I I family: If the family has one parameter θ: define the method of moments estimator θbmom to be the solution of E(X ; θbmom ) = m1 . If the family has two parameters α, β: define the method of moments estimators α bmom and βbmom to be the (simultaneous) solutions of E(X ; α bmom , βbmom ) = E(X 2 ; α bmom , βbmom ) = m1 m2 If there are k unknown parameters, compare the first k population and sample moments. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 46 / 279 Section 2.4: Method of moments examples Example 2.12. Assume {x1 , . . . xn } come from a simple random sample from an Exponential(θ), with θ unknown. unknown parameter, so need equation, involving the mean. Factsheet tells us that E(X ; θ) = . Method of moments implies that m1 = E(X ; θbmom ) = . Rearranging, θbmom = 1/m1 = 1/x = n/(x1 + . . . + xn ). Example 2.13. For earthquake data from Section 1.6, sample histogram ‘looks exponential’ – no reason to think Exp(θ) is not an adequate model. (See Section 2.5 below for a more rigorous way of doing this). In this case n = 62 and sample mean m1 = x ' 437.2097 (found using mean(quakes) in R ). Hence method of moments estimator θbmom = 1/m1 ' 0.002287. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 47 / 279 Method of moments examples Example 2.14. Assume {x1 , . . . xn } come from a simple random sample from an , with µ, σ 2 unknown. unknown parameters, so need equations, involving the mean and variance. Factsheet tells us that E(X ; µ, σ 2 ) = . Further Var (X ; µ, σ 2 ) = σ 2 , so that E(X 2 ; µ, σ 2 ) = . Method of moments estimates are joint solutions of two equations: c2 mom ) = m1 µ bmom = E(X ; µ bmom , σ c2 mom = E(X 2 ; µ c2 mom ) = m2 bmom , σ µ b2mom + σ First equation gives µ bmom = m1 = x c 2 Further σ mom = m2 − µ b2mom = m2 − m12 c2 mom = Pn (xj − x)2 /n. Rearranging, gives σ j=1 Oliver Johnson ([email protected]) Statistics 1: @BristOliver . c TB 2 UoB 2017 48 / 279 Newcomb data example Example 2.15. For Newcomb data from Section 1.6, after removing outliers x6 and x10 , sample histogram gives no reason to think N(µ, σ 2 ) is not an adequate model. Using R (since var gives sample variance, i.e. divides by (n − 1) = 63) > mean(newcomb[-c(6,10)]) [1] 27.75 > 63/64*var(newcomb[-c(6,10)]) [1] 25.4375 c2 mom = 25.4375. Hence µ bmom = 27.75, σ Equivalently, taking square roots σ bmom q c2 mom = 5.04356. = σ Obviously would get different answers if we didn’t remove outliers. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 49 / 279 Method of moments examples Example 2.16. Assume {x1 , . . . xn } come from a simple random sample from a . Since for X ∼ U(0, θ), the mean E(X ; θ) = θ/2, we solve θbmom /2 = E(X ; θbmom ) = x. In this case This may not make sense if maxi (xi ) > θbmom = 2x. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 50 / 279 Section 2.5: Assessing Fit Say we have I I I a data set of n values x1 , . . . , xn assumed to be a random sample from a population whose distribution function and pdf have parametric forms and fX (x; θ). an estimate θ̂ = θ̂(x1 , . . . , xn ) calculated from the data. Good practice to assess how well our model fits the data, by comparing the observations x1 , . . . , xn with the values we might expect for a from FX (x; θ̂). If the observations show striking or systematic differences from what we would expect, our assumed model may not be appropriate. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 51 / 279 Heuristics based on order statistics If the model FX (x; θ̂) is correct, for any i and y , the P(Xi ≤ y ) = FX (y ; θ̂). This means #{Xi ≤ y } ∼ Bin(n, FX (y ; θ̂)) ' nFX (y ; θ̂). Equivalently, if we write k = nFX (y ; θ̂), there are about k values less than y , so the kth order statistic k ; θ̂ . x(k) ' y = FX−1 n Here FX−1 denotes the inverse of FX for fixed θ (not the 1/FX ). That is, FX−1 (FX (y ; θ); θ) = y . In fact, for k = 1, . . . , n in practice (to avoid issues with k = n) we use: k −1 b x(k) ' FX ;θ n+1 b ' or equivalently FX (x(k) ; θ) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 52 / 279 Assessing fit with order statistics If model FX (x; θ̂) is correct, then – on average – the observations are likely to be equally spaced out according to this distribution. In other words the n sample values should roughly split the range of X values into n + 1 intervals each of which has probability 1/(n + 1)th. Examples showing the expected values of the order statistics for a simple random sample of size n = 4 for and . The values split each range into 5 intervals with probability 1/5 Oliver Johnson ([email protected]) 1 _ 1 _1 _1 _ 1 _ 5 555 5 1 _ 1 _ 5 5 x xx x x 1 _ 5 x 1 _ 5 x Statistics 1: @BristOliver 1 _ 5 x c TB 2 UoB 2017 53 / 279 Section 2.6: Quantile (Q-Q) plots and Probability plots For a given value of n, and a given distribution FX (x), we will call the n values k −1 FX , k = 1, . . . , n, n+1 the of the distribution. Similarly the n ordered sample values x(1) , . . . , x(n) that split the sample into roughly equal parts are called the . The discussion above leads to two simple graphical methods for assessing the fit of a model: quantile (or Q-Q) plots and probability plots. (Some authors use the term ’probability plots’ for both methods.) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 54 / 279 Quantile plot For this you must have an analytic or numerical method method for computing values of FX−1 (x; θ) (e.g. using R ). The procedure is as follows: 1 Compute an estimate estimate). 2 Order the observations to obtain the order statistics (i.e. the sample quantiles) x(1) , . . . , x(n) . b These are the For k = 1, . . . , n, compute FX−1 (k/(n + 1); θ). quantile values (i.e. the values we would expect for the quantiles if the model was correct). b x(k) ). For k = 1, . . . , n, plot the pairs (F −1 (k/(n + 1); θ), 3 4 5 for θ (e.g. the method of moments X Add the line y = x to the plot (i.e. the line through the origin with slope 1). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 55 / 279 Analysing a Q-Q plot If the points show only there is no reason to , random deviations from the line, then the model. If there are striking or systematic deviations from the line, then this may be evidence that the model is . An alternative plot is a : This proceeds in a similar way to a quantile plot, except that you now need to be able to compute values of FX (x; θ), and you plot b against the the values of the sample probabilities FX (x(k) ; θ) values k/(n + 1), k = 1, . . . , n. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 56 / 279 Quantile and Probability plots in R These plots are easy to produce in R for standard distribution families. Consider a family called name with probability density function f (x; θ) and distribution function F (x; θ) which depend on a parameter θ. Then, for given numerical values of x and θ, we can use the R functions I (x, θ) - which returns the value of the density f (x; θ) I (x, θ) - which returns the value of the probability F (x; θ) = P(X ≤ x; θ) I (x, θ) - which returns the value of the quantile F −1 (x; θ) For more information on exactly what parameters need to be specified for each distribution, use the help facility in R - for example try typing help(dexp), help(dunif), help(dgamma) or help(dnorm). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 57 / 279 Section 2.7: Earthquakes Example Example 2.17. The quakes data set records the earthquakes. times between successive serious Suggested model is an Exponential distribution with parameter θ. Write θb for the method of moments estimate θbmom . We have seen in Example that for this model θb = 1/m1 = 1/x̄. Remember that to access the data you may first need to type load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 58 / 279 Q-Q and probability plots for Earthquake data Quakes probability plot − maotj 1.0 Quakes quantile plot − maotj ● 1500 0.8 ● 0.6 ● ● ● ● ● 0 500 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0 0.4 Sample probabilities 1000 ● ● ●● ● ● ●●● ●● ● ● 500 Sample quantiles ● ● 1000 1500 Quantiles of fitted distribution ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● 1.0 Probabilities of fitted distribution Although the points do not lie exactly on a straight line, there does not appear to be any significant systematic deviation from the line y = x, and no substantial reason to reject the Exponential model. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 59 / 279 Q-Q and probability plots in R for Exp(θ) model The following R commands compute a quantile plot: > m1 <- mean(quakes) > theta <- 1/m1 > quakes.ord <- sort(quakes) > quant <- seq(1:62)/63 > quakes.fit <- qexp(quant,theta) > plot(quakes.fit, quakes.ord, ylab="Sample quantiles", xlab="Quantiles of fitted distribution", main="Title - id", abline(0,1)) The probability plot is produced by > quakes.pfit <- pexp(quakes.ord,theta) > plot(quakes.pfit, quant, abline(0, 1)) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 60 / 279 Section 2.8: Interpreting Quantile Plots The plots below ways in which the sample data may differ systematically from the predictions of the fitted model for a sample of size n = 1000. The sample: (a) comes from the fitted model, we see the points lying fairly well along the line. (b) is from a distribution with longer left and right tails than the fitted model. The sample quantiles at each end are much more spread out than one would expect if the model was correct, so they are smaller at the left-hand end and larger at the right-hand end than the corresponding expected quantiles for the fitted distribution. (c) is from a distribution with shorter left and right tails than the fitted model. The sample quantiles at each end are much less spread out than one would expect if the model was correct. (d) is from a distribution which corresponds to a random variable which is a location/scale mapping (X 7→ aX + b) of that specified by the fitted model. This does not affect the fit to a straight line, but it does affect the slope and intercept of the line of fit. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 61 / 279 4 2 0 −4 −6 −6 −4 −2 0 2 4 −6 6 −4 −2 0 2 4 6 (c) Observations have shorter tails (d) Obsns fit linear transformation 4 2 0 −2 −6 −6 −4 −2 0 2 Sanple quantiles 4 6 Expected quantiles 6 Expected quantiles −4 Sanple quantiles −2 Sanple quantiles 2 0 −2 −6 −4 Sanple quantiles 4 6 (b) Observations have longer tails 6 (a) Obsns fit model distribution −6 −4 −2 0 2 4 6 −6 −4 Expected quantiles Oliver Johnson ([email protected]) Statistics 1: @BristOliver −2 0 2 4 6 Expected quantiles c TB 2 UoB 2017 62 / 279 Section 3: Likelihood and Maximum Likelihood Estimation Aims of this section: In this section we introduce the concepts of the likelihood function and the maximum likelihood estimate. For a distribution in a given parametric family, the likelihood function acts as a summary of all the information about the unknown parameter contained in the observations. Many important and powerful statistical procedures have the likelihood function as their starting point. Here we focus on method of maximum likelihood estimation, which could be said to provide the most plausible estimate of the unknown parameter for the given data. Suggested reading: Rice Oliver Johnson ([email protected]) Section 8.5 Statistics 1: @BristOliver c TB 2 UoB 2017 63 / 279 Objectives: by the end of this section you should be able to Write down the general form of the likelihood function and the log-likelihood function, based on a simple random sample from a distribution in a general parametric family. Define the maximium likelihood estimate(s) for one or more unknown parameters, based on a simple random sample from a distribution in a general parametric family. Write down the general form of the likelihood equations(s), based on a simple random sample from a distribution in a general parametric family, and understand how and when the maximum likelihood estimate(s) can be obtained from the solution of the likelihood equations(s). Find the explicit form of the likelihood equation(s) for a simple one- or two-parameter family of distributions, and solve to find the maximium likelihood estimate(s) of the unknown parameter(s). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 64 / 279 Objectives: by the end of this section you should be able to Find the maximum likelihood estimate directly from the likelihood function for a simple one-parameter range family. Compute the maximum likelihood estimate for population quantities which are simple functions of the unknown parameter(s). Examples of such functions include the population mean, the population median and the population variance, or appropriate poulation probabilities. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 65 / 279 Section 3.1: Motivation Example 3.1. Consider a coin for which and P(Tail) = (1 − θ), where θ is an unknown parameter in (0, 1) which we wish to estimate. One way to gain information about θ is to repeatedly toss the coin and count the number of tosses until we get a Head. Assume the outcome of each toss is independent of all the other tosses, and let X denote the number of the toss on which we first get a Head. Then ∼ Geom(θ) and p (x; θ) = (1 − θ)x−1 θ Oliver Johnson ([email protected]) x = 1, 2, 3, . . . ; Statistics 1: @BristOliver θ ∈ (0, 1). c TB 2 UoB 2017 (3.1) 66 / 279 Example 3.1. Say we perform the experiment once and get a single observation x = 4 (so the first head was observed on the fourth toss). Write [or more properly L(θ; x)] for the probability of getting this particular observation x as a function of the unknown parameter θ. Thus in this case L(θ) is got by putting x = 4 in Equation (3.1), giving L(θ) = p (4; θ) = (1 − θ)3 θ θ ∈ (0, 1). We call L(θ) the for the given observation. A graph of L(θ) against θ is shown below. The value of θ that maximises the likelihood function L(θ), i.e. that maximises the probability of getting this particular observation, is called the maximum likelihood estimate of θ. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 67 / 279 0.00 0.02 0.04 L 0.06 0.08 0.10 Example 3.1. 0.0 0.2 0.4 0.6 0.8 1.0 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 68 / 279 Example 3.1. We can see that the maximum here is a turning point, so here the maximising value satisfies the equation dL(θ) =0 dθ where dL(θ) = (1 − θ)3 − 3θ(1 − θ)2 = (1 − θ)2 (1 − 4θ) dθ and you can check that maximised at d 2L dθ2 < 0, so the likelihood function is . Thus this single observation x = 4 has a greater likelihood of occurring when the parameter θ takes the value than when θ takes any other possible value in (0, 1). We call this value 0.25 the of θ for the b single observation x = 4 and we denote it by θmle [or more properly θbmle (x)] and write Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 69 / 279 Example with multiple observations Example 3.2. We can extend our analysis to the case of observations. For example, say we repeat the experiment three times and get three independent observations x1 = 4, x2 = 5, and x3 = 1. Then the corresponding random variables X1 , X2 , X3 are , each with pmf p(x; θ), so they have joint probability mass function (see Lemma 2.8) pX1 ,X2 ,X3 (x1 , x2 , x3 ; θ) = p (x1 ; θ)p (x2 ; θ)p (x3 ; θ) where the expression for p (x; θ) is given in Equation (3.1) above. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 70 / 279 Example 3.2. In this case the likelihood function denotes the probability of observing these three numerical values x1 , x2 , x3 as a function of the unknown parameter θ and is given by L(θ) ≡ L(θ; x1 , x2 , x3 ) = pX1 ,X2 ,X3 (x1 , x2 , x3 ; θ) = p (x1 ; θ)p (x2 ; θ)p (x3 ; θ) = (1 − θ)x1 −1 θ(1 − θ)x2 −1 θ(1 − θ)x3 −1 θ = (1 − θ)3 θ(1 − θ)4 θ(1 − θ)0 θ = . As before L(θ) is now maximised at the value . Thus, as a function of the unknown parameter θ, the likelihood of these three numerical observations x1 = 4, x2 = 5, x3 = 1 occurring is maximised by taking θ = 0.3. Again, we say that 0.3 is the maximum likelihood estimate of θ and we write θbmle (x1 , x2 , x3 ) = 0.3 – or, since the context is clear, Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 71 / 279 Section 3.2: Definition – Likelihood function Definition 3.3. General case: Assume the data x1 , x2 , . . . , xn are the observed values of random variables X1 , X2 , . . ., Xn , whose joint distribution depends on one or more unknown parameters θ. The L(θ) ≡ L(θ; x1 , x2 , . . . , xn ) is the joint probability mass function (discrete case) or joint probability density function (continuous case) regarded as a function of the unknown parameter θ for these fixed numerical values of x1 , x2 , . . . , xn . Definition 3.4. For observed values {x1 , . . . , xn }, the maximum likelihood estimator (mle) θbmle (x1 , . . . xn ) is the value of θ which the likelihood function L(θ; x1 , . . . , xn ). Usually just write θbmle , for the value that provides the most plausible overall explanation of the individual observations. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 72 / 279 Random sample case Example 3.5. Usual case: If X1 , X2 , . . . , Xn , is a random sample of size n from a distribution with probability mass function p (x; θ) (or probability density function f (x; θ)) then the Xi are i.i.d. and their joint distribution factorises into the product of marginals. Thus for a random sample p(x1 ; θ) p(x2 ; θ) · · · p(xn ; θ) disc L(θ) ≡ = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) cts L(θ) is a function of θ for fixed data x1 , x2 , . . . , xn . L(θ) gives a combined measure of of how well the value θ explains the set of observations, and hence of the ‘plausibility’ of θ. e.g. if are collectively unlikely as observations from fX (x; θ) then L(θ) is small, and vice-versa. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 73 / 279 Log-likelihood function In practice, instead of maximising the likelihood, we usually maximise the log-likelihood `(θ) := log L(θ), where log is the natural logarithm (and we take log 0 = −∞). Since the logarithm is an increasing function, L(θ) and `(θ) are maximised at the same value of θ. However, `(θ) is often easier to deal with in practice. Example 3.6. Let {x1 , . . . , xn } be a simple random sample from Here (see handout) f (x; θ) = . , for x > 0 and θ > 0, so. L(θ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) = (θe −θx1 )(θe −θx2 ) · · · (θe −θxn ) = θn exp(−θ(x1 + . . . + xn )). P P Means that `(θ) = n log θ − θ ni=1 xi = ni=1 (log θ − θxi ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 74 / 279 Section 3.3: Finding the maximum likelihood estimate The additive form of the log-likelihood function in Example 3.6 is not a coincidence. Lemma 3.7. For observations taken from a simple random sample `(θ) = log f (xi ; θ). Proof. `(θ) = log L(θ) = log (f (x1 ; θ) . . . f (xn ; θ)) = log f (x1 ; θ) + . . . + log f (xn ; θ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 75 / 279 Log-likelihood function and likelihood equation In ‘regular’ cases, the maximum of `(θ) will be a stationary point. That is, the mle is the solution to the ∂ `(θ) = 0. ∂θ (3.2) ∂ ∂θ denotes differentiation wrt θ keeping other variables fixed. By ‘regular’ cases, we mean those such that f is a smooth function of θ with a range not depending on θ. This includes all distributions on the handout except the uniform (see Example 3.12 below). ∂` ∂ Differentiating Lemma 3.7 we obtain: ∂θ (θ) = ∂θ log f (xi ; θ). Hence the likelihood Equation (3.2) becomes 0= n X ∂ log f (xi ; θ). ∂θ i=1 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 76 / 279 Procedure for calculating θbmle These observations suggest a recipe for finding the mle, when {x1 , . . . , xn } are observations from a simple random sample, from a continuous regular distribution with f (x; θ). ∂ ∂θ 1 Calculate the function 2 Compute and simplify the sum log f (x; θ). ∂ ∂ log f (x1 ; θ) + . . . + log f (xn ; θ), ∂θ ∂θ which we will consider as a function of θ. 3 The is the value of θ satisfying the likelihood equation 0= n X ∂ ∂θ . i=1 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 77 / 279 Example: exponential data Example 3.8. Return to the setting of Example . That is assume {x1 , . . . xn } come from a simple random sample from an , with θ unknown. Here f (x; θ) = θ exp(−θx), so that log f (x; θ) = log θ − θx. ∂ log f (x; θ) = . Treating x as a constant ∂θ This means that n X ∂ ∂ n log f (x1 ; θ) + . . . + log f (xn ; θ) = − xj . ∂θ ∂θ θ j=1 P Hence θbmle solves the likelihood equation n/θb − ( nj=1 xj ) = 0. P That is, θbmle = n/( n xj ) = 1/x. j=1 In this case, this coincides with the θbmom found in Example 2.12 – however that is not true in general. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 78 / 279 Section 3.4: Maximum likelihood estimate of τ (θ) Lemma 3.9. If the quantity of interest is a function τ (θ) of θ, the mle of τ (θ) is simply the estimate b τd (θ) = τ (θ). This is the property of maximum likelihood estimation. Proof: Not examinable – though Lemma itself is examinable. If re-parameterize in terms of τ and solve ∂`(τ )/∂τ = 0 for τb would get the same answer, at least when τ (θ) is (injective). Under new parameterization likelihood `new (τ (θ)) = `old (θ). Note that by the chain rule applied to `(τ (θ)), ∂ old ∂ new ∂ new ` (θ) = ` (τ (θ)) = ` (τ (θ)) × τ 0 (θ), ∂θ ∂θ ∂τ (θ) so if τ 0 (θ) 6= 0, ∂ new (τ (θ)) ∂θ ` Oliver Johnson ([email protected]) = 0 if and only if Statistics 1: @BristOliver ∂ new (τ ) ∂τ ` = 0. c TB 2 UoB 2017 79 / 279 Example 3.10. Again let x1 , x2 , . . . , xn be observed values of a simple random sample from the distribution with θ unknown. We found in Example 3.8 that θbmle = 1/x. Suppose we are interested in the population variance Var (X ; θ) = 1/θ2 , i.e. in τ (θ) = 1/θ2 . Then the of the population variance is b = 1/θb2 = x 2 τd (θ) = τ (θ) This is not the same as the sample variance! Suppose we are interested in the proportion of the population taking values ≥ 1, that is, in τ (θ) = P(X ≥ 1; θ) = e −θ for the Exp(θ) case. Then the of this proportion is b = e −θb = exp(−1/x) τd (θ) = τ (θ) This is not the same as the sample proportion of values ≥ 1! Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 80 / 279 Section 3.5: Mle with multiple parameters Mles can be easily extended to the case of parameters. For example, for two parameters , the α bmle and βbmle are the simultaneous solutions to the two likelihood equations n X ∂ 0= ∂α n X ∂ and 0 = ∂β i=1 . i=1 Example 3.11. For example, consider a simple random sample from the N(µ, σ 2 ) with unknown mean and variance. This family is continuous and regular. Since there are two parameters , the µ bmle and σ bmle are the simultaneous solutions to the two likelihood equations. Since f (x; µ, σ) = (2πσ 2 )−1/2 exp(−(x − µ)2 /(2σ 2 )), log f (x; µ, σ) = Oliver Johnson ([email protected]) − log σ − Statistics 1: @BristOliver (x − µ)2 . 2σ 2 c TB 2 UoB 2017 81 / 279 Mle with multiple parameters example Example 3.11. This means that ∂ x −µ log f (x; µ, σ) = , ∂µ σ2 giving the first likelihood equation Pn n X xi − nµ (xi − µ) = i=1 2 , 0= 2 σ σ i=1 which we can solve to give Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 82 / 279 Mle with multiple parameters example Example 3.11. Similarly ∂ 1 (x − µ)2 log f (x; µ, σ) = − + , ∂σ σ σ3 giving the second likelihood equation n 0=− n X (xi − µ)2 + , σ σ3 i=1 and substituting in µ = µ bmle = x, we obtain σ bmle = . Again, notice that these mles happen to coincide with the mom estimates of Example 2.14. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 83 / 279 Section 3.6: Non-regular density Recall that the likelihood equation (3.2) is based on the idea that the likelihood is maximised at a turning point, because the density is . However, if the density is not regular, then the likelihood can be maximised at one of the interval. In this case, it is best to work directly with Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 84 / 279 Non-regular density example Example 3.12. Consider a simple random sample {x1 , . . . , xn } from the distribution. In this case f (x; θ) = 1/θ if 0 ≤ x ≤ θ, and otherwise. As a result L(θ; x1 , . . . , xn ) = f (x1 ; θ) . . . f (xn ; θ) 1/θn if θ ≥ x1 , . . . , θ ≥ xn , = 0 else. Hence, the likelihood is 1/θn (decreasing function in θ) if θ ≥ x(n) = max(x1 , . . . , xn ), and zero otherwise. This means (plot a graph?) that the likelihood is maximised at . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 85 / 279 Section 3.7: Example with not-identically-distributed observations A strength of the maximum likelihood approach is that it still provides answers when the observations cannot be treated as The likelihood is just the joint probability density or mass function of the data, regarded as a function of the parameters. This makes sense, and can be maximised with respect to the parameters, whatever the model. Here we just consider one example, where the observations are still , but with different distributions. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 86 / 279 Poisson data with unequal means Example 3.13. The Poisson distribution is used to model counts of events that can be assumed to occur completely at random. For example, consider counts of different intervals of time. arriving at a detector in Suppose that the of arrival is λ per unit time, and that Xi is the number of arrivals in a time interval of known length ti (not necessarily equal). It is natural to assume that Xi ∼ Poisson(λti ), for i = 1, 2, . . . , n. If the different intervals are not overlapping, then the counts Xi should be independent. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 87 / 279 Poisson data with unequal means Example 3.13. The joint probability mass function of X1 , X2 , . . . , Xn is P{X1 = x1 , X2 = x2 , . . . , Xn = xn } = P{X1 = x1 } × P{X2 = x2 } × · · · e −λt2 (λt2 )x2 e −λtn (λtn )xn e −λt1 (λt1 )x1 × × ··· × = x1 ! x2 ! xn ! x1 x2 xn t t · · · t n = e −λ(t1 +t2 +···+tn ) λx1 +x2 +···+xn 1 2 x1 !x2 ! · · · xn ! So the log-likelihood is just the of this: `(λ) = −λ(t1 + t2 + · · · + tn ) + (log λ)(x1 + x2 + · · · + xn ) + (terms not containing λ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 88 / 279 Poisson data with unequal means Example 3.13. Then ∂`(λ) = −(t1 + t2 + · · · + tn ) + ∂λ (x1 + x2 + · · · + xn ) So ∂`(λ) =0 ∂λ b= if and only if λ = λ (x1 + x2 + · · · + xn ) (t1 + t2 + · · · + tn ) so this is the mle. The turning point we have found is obviously a , since we can see that ∂`(λ)/∂λ is decreasing in λ. bmle is the total count of photons divided by the Note that the mle λ total time of observation. Note that, when the ti are unequal, this is the same as the average of the individual estimates (xi /ti ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 89 / 279 Section 4: Assessing the Performance of Estimators Aims of this section: In the previous sections we have seen several different ways of estimating a population parameter, or population quantity of interest, from a given set of sample data. However, the sample data is just one of many possible samples that could be drawn from the population. Each sample would have different values, and so would give a different value for the estimate. In this section we use simulation based methods to investigate how the value of an estimate would vary as we took different independent random samples and hence evaluate and compare the performance of different estimators. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 90 / 279 Aims of this section: Many estimators are based on the sum of the observations in a random sample from an underlying population distribution. The exact distribution of these quantities may be difficult to compute, and varies with the underlying population distribution. The Central Limit Theorem gives a simple way of approximating the distribution of the sum or mean, that depends only on the population mean and the population variance. It also provides a plausible explanation for the fact that the distribution of many random variables studied in physical experiments are approximately Normal, in that their value may represent the overall addition of a number of random factors. Ross Suggested reading: Rice Rice Oliver Johnson ([email protected]) Sections 10.1-10.3 Sections 8.4, 8.5, 8.8 Section 5.3 Statistics 1: @BristOliver c TB 2 UoB 2017 91 / 279 Objectives: by the end of this section you should be able to Generate random samples from a given standard distribution using the random number generator for that distribution in a statistical package such as R . Understand how the performance of an estimator can be related to systematic and random error through the bias and variance of the estimator. Evaluate the performance of an estimator for a single quantity of interest, both qualitatively from a boxplot or histogram of estimates from repeated samples and quantitatively or numerically from summary statistics derived from the repeated samples. Recall the statement and the implications of the Central Limit Theorem. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 92 / 279 Objectives: by the end of this section you should be able to Apply the Central Limit Theorem to find the approximate distribution of the sum or mean of a random sample from a population distribution with known mean and variance. Apply a continuity correction to improve the approximation given by the Central Limit Theorem when the underlying variable is an integer valued random variable representing counts. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 93 / 279 Section 4.1: Different methods of Estimation We have seen two general model-based methods for estimating a population quantity τ = τ (θ) (the method of moments and the method of maximum likelihood). These estimate θ (by θbmom and θbmle respectively), and plug the estimates θb into τ (θ) to give an estimate of τ . There may also be a direct alternative, in which we simply use the relevant sample quantity to estimate the population value. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 94 / 279 Example Example 4.1. Consider estimating the population median for a population which has a distribution where θ is unknown, using a random sample with values x1 , . . . , xn . The parametric methods use the fact that the population median for this distribution is θ/2. I I I The method of moments estimates by θbmom = 2x and so estimates the population median by θbmom /2 = x. The method of maximum likelihood estimates θ by where x(n) = max{x1 , . . . , xn } is the largest value in the sample, and thus estimates the population median by θbmle /2 = x(n) /2. The non-parametric method estimates the population median by the sample median (for n odd, this is x((n+1)/2) ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 95 / 279 Comparing estimates For a given set of different estimates. The questions are I I , the three methods will result in three which estimate (or method of estimation) is best? how can we compare the methods when we don’t actually know the value of the quantity we want to estimate? If we do not know the value we are trying to estimate, we cannot usefully compare methods of estimation using only the resulting numerical estimates from a single sample. The main way we compare methods of inference is to see how they perform under repeated sampling. We imagine future hypothetical samples of the same size from the same distribution, and examine how well each method performs in the long run. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 96 / 279 Section 4.2: Repeated sampling, and sampling distributions In probability language, we treat the sample as a collection of random variables X1 , X2 , . . . , Xn , regard the estimators as functions of these random variables, and look at the distributions of these . We make a Key Definition, that motivates much of the rest of the course: Definition 4.2. b 1 , . . . , Xn ) as a We refer to the distribution of an estimator θb = θ(X , as opposed to the original population distribution. Remark 4.3. A good estimator is one whose sampling distribution is concentrated close to the true value of the quantity it is trying to estimate. A poor estimate is one where the sampling distribution is either very spread out, or is concentrated around the wrong value. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 97 / 279 Good estimators Definition 4.4. Let θb be an estimator of an unknown parameter θ. We define two key properties of its sampling distribution: b = E(θb − θ) = E(θ) b − θ. bias (θ) Say θ̂ is unbiased if bias(θ̂) = 0. (θ̂) = E[(θ̂ − θ)2 ] b = Var (θ) b + bias (θ) b 2. Can check that mse (θ) Example 4.5. In some rare cases, we can use methods from the to calculate the exact sampling distribution theoretically. course For example, if X1 , X2 , . . . , Xn ∼ N(µ, σ 2 ), we know that µ bmom = so that µ bmom = X ∼ N(µ, σ 2 /n) (see Theorem 5.9 below). , Hence bias (b µmom ) = E(b µmom ) − µ = µ − µ = 0, mse (b µmom ) = Var (b µmom ) + (bias (b µmom ))2 = σ 2 /n + 02 = σ 2 /n. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 98 / 279 Section 4.3: Sampling distributions by simulation A more general, but empirical, approach is to use . In statistics, simulation is the process of artificially generating a as independent observations from a given probability distribution. Simulation-based procedures for evaluating a method of estimation replace the above idea of hypothetical future samples and probability calculations, with actual simulated numerical samples and numerical calculations. Thus, for a particular type of population distribution f (x; θ), we take particular values for the parameter θ and the sample size n. Then we generate a number (say ) of artificial data sets, each of which looks like a simple random sample of size n from f (x; θ). Calculate an estimate for each data set, giving a total of B different estimates. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 99 / 279 Motivation The idea is that this process represents, say, the experience of independent statisticians each using the method. If B is , the estimates generated by these independent experiments should give a good indication of the sampling distribution, and hence the overall performance of the method. Moreover, we can understand the strengths and weaknesses of different methods of estimation by comparing their overall performances on the same B data sets. In R we use the following procedure I I I Generate numbers from f and arrange them in B groups of n. Calculate the estimates for each sample. Analyse the results numerically or graphically. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 100 / 279 Section 4.4: Graphical summaries of performance One way of exploring the performance of different methods is just to plot a of the B estimates obtained in the simulation study. Example 4.6. Consider again the problem of estimating the population median for the Uniform(0, θ) distribution. The histograms below were constructed by simulating each of size n = 10, from a Uniform distribution with true population median was θ/2 = 0.5). samples, (so the For each sample we compute the sample median, the method of moments estimate (x̄) and the maximum likelihood estimate (max{x1 , . . . , xn }/2). We plot histograms of the 1000 estimates produced by each method. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 101 / 279 Histograms of results 0.0 0.4 0.8 Sample median 12 0 2 4 6 Density 8 10 12 10 8 0 2 4 6 Density 8 6 0 2 4 Density 10 12 Example 4.6. 0.0 0.4 0.8 Method of moments 0.0 0.4 0.8 Maximum likelihood For this example, the differences in the shape of the histograms are particularly striking. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 102 / 279 Graphical summary of performance – boxplots Example 4.6. boxplots are also convenient to compare visually the sampling distributions of different estimators. They use the median of each as a measure of the centre of the distribution and the upper and lower as a measure of spread. 0.2 0.4 0.6 0.8 Estimators of population median: Unif(0,=1) with n=10 Sample median Oliver Johnson ([email protected]) MOM Statistics 1: @BristOliver MLE c TB 2 UoB 2017 103 / 279 Example 4.6. Clearly the estimates produced by the and the method of moments are both centred on the true population median value of 0.5 but fairly widely spread about this value (sample median more than mom). The maximum likelihood estimates are centred on a value just below and so the method slightly, but consistently, underestimates the true value. However the narrow spread of mle values means that for most samples the mle is closest to the true value. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 104 / 279 Section 4.5: Numerical summaries of performance Can sometimes derive explicit expressions for the bias and mean square error of an estimator (see Example 4.5). Usually we can only estimate these numerically (via simulation). Say we want to estimate a function τ (θ) whose true value is τ and our simulation τb1 , . . . , τbB with sample P study produces B estimates P B mean τ = B τ̂ /B and sample variance τi − τ )2 /(B − 1). i=1 i i=1 (b The (indicator of bias) is τ − τ . To estimate mse, consider PB (b τi − τ )2 average squared error = i=1 . B P P You can check that B τi − τ )2 = B τi − τ )2 + B , so i=1 (b i=1 (b that for B large the average squared error ' sample variance + (average error)2 . b = Var (θ) b + bias (θ) b 2 ). (cf mse (θ) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 105 / 279 Example 4.7. For our Uniform distribution example, we estimate the population median τ , (true value is τ = 0.5) with B = 1000. The table confirms numerically the impression from the graphical summary – the size of the average error is larger for than for the other two methods. Mle has a much smaller sample variance than using the method of moments, which in turn has a smaller sample variance than the non-parametric method using the sample median. Overall mle has the smallest average squared error, then , then the estimate based on the sample median. Methods of estimating the population median average error (estimates bias) sample variance (estimates variance) average squared error (estimates mse) sample median mom mle +0.00497 +0.00376 −0.04403 0.01861 0.00798 0.00168 0.01863 0.00800 0.00362 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 106 / 279 Section 4.6: Simulation using R We illustrate this in the context of Example 4.6, estimating θbmom = 2x, taking B = 1000 and n = 10. by Example 4.8. xvalues <- runif(10000) xsamples <- matrix(xvalues, nrow=1000) sample.mean <- apply(xsamples, 1, mean) theta.mom <- 2*sample.mean par(mfrow=c(1,2)) hist(theta.mom, main = "Histogram of theta.mom") boxplot(theta.mom, main= "Boxplot of theta.mom") true.theta <- 1 mean(theta.mom - true.theta) var(theta.mom) mean( (theta.mom - true.theta)^2) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 107 / 279 Explaining R commands Example 4.8. xvalues <- runif(10000) generates 10000 values, and assigns them to a vector called xvalues. (By default runif simulates from U(0, 1), but runif(10000, min=-1, max =2) would simulate from U(−1, 2)). xsamples <- matrix(xvalues, nrow=1000) arranges this into a matrix with rows, each of length , each row corresponding to a random sample. sample.mean <- apply(xsamples, 1, mean). The apply command applies the command mean to the set of values that share subscript 1 (i.e. to the each row in turn). This generates a vector sample.mean made up of the means of each sample. (Try apply(xsamples, 1, var) or apply(xsamples, 1, max) to generate the sample variance and maximum). par(mfrow=c(1,2)) puts the plots next to each other. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 108 / 279 Graphical output Example 4.8. The remaining code generates boxplots and histograms, and calculates the average error and average squared error. 1.2 1.0 150 0 0.4 0.6 50 0.8 100 Frequency Boxplot of theta.mom 1.4 200 Histogram of theta.mom 0.4 0.8 1.2 1.6 theta.mom Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 109 / 279 Graphical output – comparison via boxplots Example 4.8. As in , we can compare the output of several estimators with boxplots generated with the following code xvalues <- runif(10000) xsamples <- matrix(xvalues, nrow=1000) sample.mean <- apply(xsamples, 1, mean) sample.median <- apply(xsamples, 1, median) sample.max <- apply(xsamples, 1, max) tau.nonparam <- sample.median tau.mom <- sample.mean tau.mle <- sample.max/2 true.tau <- 0.5 boxplot(tau.nonparam,tau.mom,tau.mle, names = c("sample median","mom","mle",abline(h=true.tau, lty=2))) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 110 / 279 Graphical output – comparison via histograms Example 4.8. Here plots a horizontal line at the true value of τ for comparison purposes, with lty=2 creating a line. Similarly, we can generate multiple histograms using the following code, which fixes the range of x and y to make clear comparison easier. par(mfrow = c(1,3)) hist(tau.nonparam, xlim=c(0,1), ylim=c(0,350)) hist(tau.mom, xlim=c(0,1), ylim=c(0,350)) hist(tau.mle, xlim=c(0,1), ylim=c(0,350)) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 111 / 279 Section 4.7: Approximate methods based on the Central Limit Theorem One disadvantage of simulation-based methods is that each simulation only provides information about situation and gives no direct information about what would happen for: I I I I other other other other sample sizes n values of true parameter types of population distribution f (x; θ) methods of estimation Also, the numerical accuracy of estimates of quantities like the bias is limited by the finite size of , the number of samples. Therefore, as before we use probability theory to find sampling distributions whenever this is possible. For example when estimating µ using a simple random sample from N(µ, σ 2 ) (see Example ), we can calculate these key quantities analytically: the bias is 0, and the variance and mse are both . In general, we cannot do this. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 112 / 279 Approximations using the CLT However, many estimators are based on the sum or sample. of a random The Central Limit Theorem lets us approximate the distribution of such estimators, the distribution of the sample. The limiting distribution depends only on the mean and variance (not the actual distribution). The of convergence depends on X . For example, if X is symmetric with not too heavy tails, convergence is faster. The CLT is one of the most fundamental results in statistics – it can explain why many ‘real world’ data samples seem to be close to normal. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 113 / 279 Central Limit Theorem Theorem 4.9 (Central Limit Theorem). Let X1 , . . . , Xn be a random sample from a population with mean µ = E(X ) and variance σ 2 = Var (X ). Write X n = (X1 + . . . + Xn )/n for the sample mean. Then for n large, whatever the distribution of X Xn − µ √ ' N(0, 1). The normalized sample mean σ/ n That is, for Z ∼ N(0, 1) with distribution function Φ: P Xn − µ √ ≤x σ/ n ' P(N(0, 1) ≤ x) = Φ(x). Equivalently (a) X n = (X1 + . . . + Xn )/n ' N( and (b) (X1 + . . . + Xn ) ' N( ). ) Proof. See Section 5. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 114 / 279 Example Example 4.10. Let X1 , . . . , Xn be a random sample of size from the Exp(2) distribution, which has mean µ = 1/2 and variance σ 2 = 1/4. The CLT tells us that Xn − µ (X + . . . + X10 )/10 − 1/2 √ = 1 √ ' N(0, 1). σ/ n 1/(2 10) Hence if we want to approximate P(X1 + . . . + X10 ≤ 5.2) then P(X1 + . . . + X10 ≤ 5.2) = P(X 10 ≤ 5.2/10) 0.52 − 1/2 X 10 − 1/2 √ √ = P ≤ 1/(2 10) 1/(2 10) ' P(N(0, 1) ≤ 0.1265) . This is found using pnorm(0.1265). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 115 / 279 Example Example 4.10. In this case, X1 + . . . + X10 ∼ Γ(10, 2) (exactly), and pgamma(5.2,10,2) gives , showing the approximation is OK, but not amazing. For n = 1000, the P(X1 + . . . + Xn ≤ 502) is approximated by again, and the gamma value is pgamma(502,1000,2) = (much better). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 116 / 279 Section 4.8: Continuity correction Consider X taking integer values, and consider T = X1 + . . . + Xn , where Xi are IID with the same distribution as X . Theorem 4.9 says that P(T ≤ x) ' P(S ≤ x), where S ∼ N( ). However, since T can only take integer values, better to approximate P(T = x) ' P(x − 1/2 ≤ S ≤ x + 1/2), P(T ≤ x) ' P(S ≤ x + 1/2) where the second result follows on summing the first. This is referred to as a . Example 4.11. Let X1 , . . . , X10 be IID Bernoulli(p), with p = 1/4. Then (see appendix) µ = p = 1/4, and σ 2 = p(1 − p) = 3/16. Consider T = X1 + . . . + X10 . Theorem 4.9 suggests T ' S ∼ N(nµ, nσ 2 ) = N(10/4, 30/16). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 117 / 279 Section 4.8: Continuity correction Example 4.11. Consider approximating P(T ≤ 2): 1 T ∼ Bin(10, 1/4) exactly, so using pbinom(2,10,0.25): P(T ≤ 2) = P(T = 0) + P(T = 1) + P(T = 2) 2 Without continuity correction is inaccurate, using pnorm(-0.3651): ! S − 10/4 2 − 10/4 P(T ≤ 2) ' P(S ≤ 2) = P p ≤ p 30/16 30/16 = P(N(0, 1) ≤ −0.3651) 3 . . With continuity correction gives a better result, using pnorm(0): ! S − 10/4 2.5 − 10/4 P(T ≤ 2) ' P(S ≤ 2.5) = P p ≤ p 30/16 30/16 = P(N(0, 1) ≤ 0) Oliver Johnson ([email protected]) Statistics 1: @BristOliver . c TB 2 UoB 2017 118 / 279 Section 5: Sampling distributions related to the Normal distribution Aims of this section: Knowing the exact distribution of an estimator helps us to understand how its behaviour depends, for example, on the sample size or the unknown population parameter values. It also enables us to incorporate our theoretical results into other aspects of our statistical analysis. In this section we derive the exact distribution of some sample statistics associated with random samples from a range of distributions, particularly focussing on the mean and variance of a random sample from the Normal distribution. Suggested reading: Oliver Johnson ([email protected]) Rice Rice Section 2.3 Section 6.1–6.3 Statistics 1: @BristOliver c TB 2 UoB 2017 119 / 279 Objectives: by the end of this section you should be able to Recall the distibution of the sample mean for a simple random sample from a Normal distribution and use statistical tables or an appropriate statistical package to compute relevant probabilities associated with its distribution. Recall the distibution of the sample variance, for a simple random sample from a Normal distribution; understand that it is independent of the sample mean; and use statistical tables or an appropriate statistical package to compute relevant probabilities associated with its distribution. Identify the distribution of a sum or linear combination of independent Normally distributed random variables, and use statistical tables or an appropriate statistical package to compute relevant probabilities associated with its distribution. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 120 / 279 Objectives: by the end of this section you should be able to Identify the distribution of a sum of squares of independent random variables, each with the standard Normal distribution, and use statistical tables or an appropriate statistical package to compute relevant probabilities associated with its distribution. Identify the distribution of a sum of independent random variables, each with the same Exponential distribution, and use statistical tables or an appropriate statistical package to compute relevant probabilities associated with its distribution. Define the chi-square and t distributions, and look up percentile points of each distribution in tables or with R . Apply the results above to find the mean and variance of an estimator of a parameter or other population quantity of interest, and hence find its bias and mean square error. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 121 / 279 Section 5.1: Revision of moment generating functions Definition 5.1. For random variable X define the moment generating function (mgf) MX by R tx e f (x)dx for continuous X tX MX (t) ≡ E(e ) = P txX e P(X = x) for discrete X x MX is defined for whatever values of t the integral is well defined. MX uniquely determines the distribution: two random variables with the same mgf (assuming it is finite in an interval around the origin) have the same distribution. Example 5.2. X ∼ N(µ, σ 2 ) ⇐⇒ MX (t) = exp{µt + σ 2 t 2 /2} X ∼ Exp(θ) ⇐⇒ MX (t) = θ/(θ − t) X ∼ Gamma(α, β) ⇐⇒ MX (t) = β α /(β − t)α Oliver Johnson ([email protected]) Statistics 1: @BristOliver t∈R t<θ t<β c TB 2 UoB 2017 122 / 279 Lemma 5.3. If Y = aX + b then MY (t) = E(e tY ) = E(e taX +tb ) = e tb MX (ta) Definition 5.4. The joint mgf of X and Y is MX ,Y (s, t) ≡ E(e sX +tY ). Lemma 5.5. 1 The marginal moment generating functions for X and Y are given in terms of the joint moment generating function by MX (s) = E(e sX ) = MX ,Y (s, 0) MY (t) = E(e tY ) = MX ,Y (0, t) 2 Two random variables X and Y are independent if and only if MX ,Y (s, t) = MX (s)MY (t) = MX ,Y (s, 0)MX ,Y (0, t). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 123 / 279 Independence Lemma 5.6. 1 If X1 , . . . , Xn are independent and Y = X1 + X2 + · · · + Xn , then so MY (t) = MX1 (t)MX2 (t) . . . MXn (t). 2 If X1 , . . . , Xn are a random sample, i.e. they are all independent and all ∼ X , then this simplifies to MY (t) = [MX (t)]n . Proof. Observe MY (t) = E(e tY ) = Oliver Johnson ([email protected]) = E(e tX1 )E(e tX2 ) . . . E(e tXn ). Statistics 1: @BristOliver c TB 2 UoB 2017 124 / 279 Section 5.2: Transforming, adding and sampling normals Lemma 5.7. Let X ∼ N(µ, σ 2 ) then aX + b ∼ N(aµ + b, a2 σ 2 ). Let X ∼ N(µ, σ 2 ) then (X − µ)/σ ∼ N(0, 1). Proof. Already know mean and variances; just checking distribution. Example 5.2 shows that MX (t) = exp(µt + σ 2 t 2 /2). Lemma 5.3 shows that MaX +b (t) = MX (at) exp(bt) = exp(µat + σ 2 a2 t 2 /2) exp(bt) = exp (µa + b)t + a2 σ 2 t 2 /2 We recognise this as the mgf of a N(aµ + b, a2 σ 2 ), and the result follows by uniqueness of mgfs. The second part follows on taking Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 125 / 279 Addition of normal distributions Lemma 5.8. 2 If X1 , . . . P , Xn are independent with P P X2i ∼2 N(µi , σi ) then for any weights ai , the sum i ai Xi ∼ N i ai µ i , i ai σi . Proof. Combining Lemmas 5.6 and 5.3 we find that n n Y Y P M i ai Xi (t) = Mai Xi (t) = MXi (ai t) i=1 = n Y i=1 = exp i=1 exp(µi ai t + σi2 ai2 t 2 /2) " n X i=1 ai µi ! t+ n X i=1 ai2 σi2 ! 2 # t /2 . Again the result follows by uniqueness of mgfs. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 126 / 279 Sample mean for normal distribution Theorem 5.9. Let X1 , . . . , Xn be a random sample of size n from N(µ, σ 2 ) and X = (X1 + . . . + Xn )/n be the . (i) X ∼ N(µ, σ 2 /n). √ X −µ (ii) n ∼ N(0, 1). σ Compare this with the Central Limit Theorem 4.9, which states that this result is approximately true for large n. If σ 2 is , this result tells us how close we expect µ and X to be. Hence, we can make inference about unknown µ based on known X . Much of the remainder of the chapter extends this to the more realistic case of σ 2 unknown, to prove Theorem 5.18. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 127 / 279 Proof of Theorem 5.9(i) Proof. Taking ai ≡ 1/n in Lemma 5.8, since µi ≡ µ and σi2 ≡ σ 2 , we deduce X = n X 1 i=1 1 1 Xi ∼ N n µ, n 2 σ 2 n n n = N(µ, σ 2 /n). The second result follows on noticing that √ X −µ n = aX + b, σ where and . Hence applying Lemma 5.7 we deduce that √ X −µ n ∼ N(aµ + b, a2 (σ 2 /n)) = N(0, 1). σ Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 128 / 279 Sketch Proof of Central Limit Theorem Proof. We now prove Theorem 4.9, which states that if Xi are IID and distributed like X with finite mean and variance: Sn = Xn − µ √ σ/ n Z ∼ N(0, 1). We use a result (not proved here), that states that if MSn (t) → MZ (t) for all t, then P(Sn ≤ x) → P(Z ≤ x). This is the sense of the Central Limit Theorem that we claimed in Theorem 4.9. First recall from Lemma 5.6 that Tn = X1 + . . . + Xn has MTn (t) = [MX (t)]n . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 129 / 279 Sketch Proof of Central Limit Theorem Proof. Further, note that √ √ √ n X1 + . . . + Xn µ n Tn µ n Sn = − = √ − . σ n σ σ σ n √ √ As before, this is aTn + b, where a = 1/(σ n) and b = −µ n/σ. Lemma 5.3 gives MSn (t) = exp(tb)MTn (at) = [exp(tb/n)MX (at)]n . We expand MX (u) in terms of Mi , the ith moment of X . Since M1 = µ and M2 = σ 2 + µ2 MX (u) = ∞ X Mi u i i=0 so that MX (at) = 1 + Oliver Johnson ([email protected]) i! µt √ σ n 1 = 1 + µu + (σ 2 + µ2 )u 2 + . . . . 2 + σ 2 +µ2 2 t2 σ2 n Statistics 1: @BristOliver + ... c TB 2 UoB 2017 130 / 279 Sketch Proof of Central Limit Theorem Proof. Hence we can simplify exp(tb/n)(MX (at)) µt tµ t 2 µ2 (σ 2 + µ2 )t 2 ) 1+ √ + + ... = 1 − √ + 2 + ... 2σ 2 n σ n 2σ n σ n t2 1 = 1+ +O . 2n n2 Hence, using the fact that limn→∞ (1 + a/n)n = exp a: n t2 MSn (t) = 1 + + . . . → exp(t 2 /2) = MZ (t) 2n We recognise this as the mgf of a standard normal, and we are done. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 131 / 279 Section 5.3: Independence of X and Pn j=1 (Xj − X )2 . We now state a result which plays a fundamental role in underpinning the theory of many of the remaining parts of the course. The statement is very simple, the proof is fairly technical. Result at first sight looks extremely surprising! Theorem 5.10. If X1 , . . . , Xn is a random sample of size n from the N(µ, σ 2 ) distribution then n X and (Xj − X )2 are independent. j=1 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 132 / 279 Sketch Proof of Theorem 5.10 The proof is made up of the following steps: (i) the definition gives the joint moment generating function of X , X1 − X , X2 − X , . . . , Xn − X as the expected value of a function of the variables; (ii) simple manipulation reduces this complicated expression to a simple product of terms of the form (iii) since the Xj are independent the expectation of this product is just the product of the expectations, each of which is MXj (aj ), giving the joint mgf; (iv) finally we observe that the is the product of marginal mgf for X and the (marginal) jointmgf for X1 − X , X2 − X , . . . , Xn − X . By analogy with Lemma 5.5.2,X is independent P of X1 − X , X2 − X , . . . , Xn − X and hence of nj=1 (Xj − X )2 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 133 / 279 Full Proof of Theorem 5.10: Part (i) Proof. Let X denote (X1 + · · · + Xn )/n. Let s denote (s1 + s2 + · · · + sn )/n (so Pn j=1 (sj − s) = 0). Let M(t, s1 , . . . , sn ) denote the joint moment generating function of the n + 1 random variables X , X1 − X , X2 − X , . . . , Xn − X . Then by definition M(t, s1 , . . . , sn ) = E(exp{tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X )}). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 134 / 279 Full Proof of Theorem 5.10: Part (ii) Proof. Now rearranging the terms in the curly brackets gives tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X ) = (t − s1 − s2 − · · · − sn )X + s1 X1 + s2 X2 + · · · + sn Xn = a1 X1 + · · · + an Xn Here aj = M(t, s1 , . . . , sn ) t− P i si n + sj = t + (sj − s). n = E(exp{tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X )}) = E(exp{a1 X1 + · · · + an Xn }) = E(exp{a1 X1 } exp{a2 X2 } · · · exp{an Xn }) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 135 / 279 Full Proof of Theorem 5.10: Part (iii) Proof. Since the Xj are independent, using Lemma 5.3 M(t, s1 , . . . , sn ) = E(exp{a1 X1 })E(exp{a2 X2 }) · · · E(exp{an Xn }) = MX1 (a1 )MX2 (a2 ) · · · MXn (an ) σ 2 a12 σ 2 an2 = exp µa1 + · · · exp µan + 2 2 X σ2 X 2 = exp µ aj + aj 2 j j n 2 X σ (sj − s)2 . = exp µt + σ 2 t 2 /2n + 2 j=1 The last equality follows from the facts that Pn Pn 2 2 2 j=1 aj = t /n + j=1 (sj − s) . Oliver Johnson ([email protected]) Statistics 1: @BristOliver Pn j=1 aj = t and c TB 2 UoB 2017 136 / 279 Full Proof of Theorem 5.10: Part (iv) Proof. Hence M(t, 0, . . . , 0) = exp{µt + σ 2 t 2 /2n} n X 2 M(0, s1 , . . . , sn ) = exp{σ (sj − s)2 /2} j=1 giving M(t, s1 , . . . , sn ) = M(t, 0, . . . , 0) M(0, s1 , . . . , sn ). Thus X is independent of the random variables (X1 − X , X2 − X , . . . , Xn − X ) and in particular X is independent of P n 2 j=1 (Xj − X ) . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 137 / 279 Section 5.4: The χ2 distribution Definition 5.11. We say that a random variable r degrees of freedom, and write has the if W has mgf MW (t) = (1 − 2t)−r /2 distribution with for t < 1/2. Remark 5.12. (i) Comparison with they have the same mgf. shows that χ2r ≡ Γ(r /2, 1/2), since (ii) If W ∼ χ2r then (see handout), EW = (r /2)/(1/2) = r and Var (W ) = (r /2)/(1/2)2 = 2r . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 138 / 279 χ2 are squared normals Lemma 5.13. If Z ∼ N(0, 1) then Y = Z 2 ∼ . Proof. MY (t) = E exp(tY ) = Z = exp(tz 2 )φ(z)dz 2 Z z 1 2 dz = exp(tz ) √ exp − 2 2π "Z 2 # 1 1 z (1 − 2t) p = exp − dz 2 (1 − 2t)1/2 2π/(1 − 2t) = (1 − 2t)−1/2 [1] , The last equation holds since we recognise the integrand as the density of a N(0, 1/(1 − 2t)), which integrates to 1. Since Y has the same mgf as a χ21 , the result holds by uniqueness. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 139 / 279 Sums of χ2 are χ2 Lemma 5.14. 1 2 are independent then U + V ∼ χ2r +s . P If independent Zi ∼ N(0, 1) then ni=1 Zi2 ∼ χ2n . If and Proof. By definition MU (t) = (1 − 2t)−r /2 and MV (t) = (1 − 2t)−s/2 . Hence by Lemma 5.6, MU+V (t) = MU (t)MV (t) = (1 − 2t)−r /2 (1 − 2t)−s/2 = (1 − 2t)−(r +s)/2 . This is the mgf of χ2r +s , and so the result follows by uniqueness. The final part follows by Lemma 5.13. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 140 / 279 Section 5.5: Normal sampling distributions Theorem 5.15. Let X1 , . . . , Xn be a random sample of size n from the N(µ, σ 2 ) distribution. Then: Pn 2 2 (i) . j=1 (Xj − µ) /σ ∼ Pn 2 2 . (ii) j=1 (Xj − X ) /σ ∼ Proof: Part (i). Writing Yj = (Xj − µ)/σ, Lemma 5.7 gives . Pn 2 2 Further, Yj are independent, so j=1 Yj ∼ χn by Lemma P P But nj=1 (Xj − µ)2 /σ 2 = nj=1 Yj2 , so we are done. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 141 / 279 Proof of Theorem 5.15(ii) Proof: Part (ii). P P 2 Set W3 ≡ nj=1 Yj2 , W2 ≡ nj=1 (Yj − Y )2 and W1 ≡ nY . P Note that using nj=1 Yj = nY we can write P P 2 W3 = nj=1 Yj2 = nj=1 (Yj − Y )2 + nY = W1 + W2 . P ( To see this, try expanding j (Yj − Y )2 . ) Further, W1 and W2 are independent, from Theorem 5.10. Thus MW3 (t) = MW1 +W2 (t) = MW1 (t)MW2 (t), or equivalently, MW2 (t) = /MW1 (t). √ 2 But W1 ∼ χ1 from Lemma 5.13, as nY ∼ N(0, 1) from Theorem 5.9, since µ = 0 and σ 2 = 1. Similarly, W3 ∼ χ2n , using Theorem 5.15(i). Hence MW2 (t) = (1 − 2t)−n/2 /(1 − 2t)−1/2 = (1 − 2t)−(n−1)/2 P This is the mgf for , hence W2 = nj=1 (Xj − X )2 /σ 2 ∼ χ2n−1 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 142 / 279 Section 5.6: The t distribution Definition 5.16. Let U and V be independent random variables with V ∼ χ2r and let U . W =p V /r and We say that W has the t distribution with r degrees of freedom, and write W ∼ tr . Remark 5.17. It is vital that U and V be independent. It’s why we needed to go to the effort of proving Theorem 5.10. If W ∼ tr , the density of W is about 0. It is similar to N(0, 1) but with heavier tails. The density approaches N(0, 1) as . W has EW = 0, Var (W ) = r /(r − 2). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 143 / 279 The most important slide of the whole course? Theorem 5.18. Let X1 , . . . , Xn be a random sample of size n from the N(µ, σ 2 ) X = (X1 + . . . + Xn )/n and distribution Pn then, writing 2 2 S = j=1 (Xj − X ) /(n − 1): 1 2 3 4 √ n(X − µ)/σ ∼ N(0, 1) Pn V := j=1 (Xj − X )2 /σ 2 ∼ χ2n−1 U := U and V are independent. √ n(X − µ) ∼ . S This result allows us to know how far apart we expect µ and X to be, even when σ 2 is . This makes it the counterpart of Theorem 5.9. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 144 / 279 Proof of Theorem 5.18 Proof. Most of these facts are already known: √ Here U = n(X − µ)/σ ∼ N(0, 1) by Theorem P Here V = nj=1 (Xj − X )2 /σ 2 ∼ χ2n−1 by Theorem S 2 /σ 2 = V /(n − 1) ∼ χ2n−1 /(n − 1). U and V are , so that by Theorem 5.10. The result is proved since √ √ n(X − µ) n(X − µ) 1 p = ∼ N(0, 1) q S σ S 2 /σ 2 χ2 1 n−1 /(n − 1) where the two terms are independent as required. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 145 / 279 Section 5.7: Percentage points of distributions In applications, we need to know values xα such that . Using this: P(X ≤ x1−α ) = 1 − P(X ≥ x1−α ) = 1 − (1 − α) = α. Often need these for α = 0.1, 0.05, 0.025, 0.01 etc. Traditionally given in tables, but now more commonly calculated by R . RV Z ∼ N(0, 1) T ∼ tr W ∼ χ2r Notation P(Z ≥ zα ) = α P(T ≥ tr ;α ) = α P(W ≥ χ2r ;α ) = α Symmetry? Yes:z1−α = −zα Yes:tr ;1−α = −tr :α No:χ2r ;1−α 6= −χ2r ;α R command qnorm(1 − a) qt(1 − a, r) qchisq(1 − a, r) Remark 5.19. Using this notation, we deduce (draw a picture?): (1 − α) = P(−zα/2 ≤ Z ≤ zα/2 ) (1 − α) = P(−tr ;α/2 ≤ T ≤ tr ;α/2 ) (1 − α) = P(χ2r ;1−α/2 ≤ W ≤ χ2r ;α/2 ) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 146 / 279 Section 5.8: Similar theory for Γ distributions We can give some similar results giving exact distributions for sums and means of exponentials The following result is equivalent to Theorem . It can be proved in a similar way using facts about mgfs from Example 5.2. Lemma 5.20. Let X1 , . . . , Xn be a random sample of size n from the Exp(θ) distribution. Then Pn j=1 Xj ∼ Γ(n, θ). P X = ( nj=1 Xj )/n ∼ . Pn 2θ j=1 Xj ∼ Γ(n, 1/2) = χ22n . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 147 / 279 Section 5.9: The F distribution Definition 5.21. Let U ∼ χ2r and independently, and let W = U/r . V /s Then W has the F distribution with r and s degrees of freedom and we write Define the percentage point Fr ,s;α as the value such that P(W ≥ Fr ,s;α ) = α when W ∼ Fr ,s . The density function is heavily skewed with a long tail. If W ∼ Fr ,s then by definition, 1/W ∼ Fs,r so Fr ,s;1−α = 1/Fs,r ;α . This distribution forms a starting point for statistical inference about equality of variances for two normal population, linear regression and analysis of variance (see Linear Models). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 148 / 279 Applications of the F distribution Theorem 5.22. Let X1 , . . . , Xm be a random sample of size from N(µX , σX2 ), Independently, let Y1 , . . . , Yn be a random sample of size N(µY , σY2 ). When σX2 = σY2 = from (say) Pm (Xi − X )2 /(m − 1) ∼ Fm−1,n−1 . T = Pi=1 n 2 j=1 (Yj − Y ) /(n − 1) Proof. From Theorem , independently Pm c 2 2 σX = i=1 (Xi − X ) /(m − 1) ∼ σX2 χ2m−1 /(m − 1) and let Pn 2 = σc (Y − Y )2 /(n − 1) ∼ σ 2 χ2 /(n − 1). Y j=1 j Y n−1 The distribution of T is independent of the unknown parameters µX , µY and σ 2 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 149 / 279 Section 6: Confidence intervals Aims of this section: In previous sections we have seen how use the observations in a simple random sample from a given population to estimate a population parameter or some other population quantity of interest. However, we have also seen that different samples would give different estimates, so our estimate cannot be ’exactly’ correct. In this section we derive procedures for reporting the accuracy of our estimate by constructing a confidence interval - an interval of values around the estimate which has a pre-set level of probability of containing the true value of the parameter or other quantity being estimated. Suggested reading: Rice Oliver Johnson ([email protected]) Sections 7.3.3, 8.5,3, 10.4.5 Statistics 1: @BristOliver c TB 2 UoB 2017 150 / 279 Objectives: by the end of this section you should be able to Construct an exact confidence interval for the population mean, with a given confidence level, based on a simple random sample from a Normal distribution. Construct an exact confidence interval for the population variance, with a given confidence level, based on a simple random sample from a Normal distribution. Recall and explain the assumptions under which the standard formulae for confidence intervals are applicable and be aware of how the validity of these assumptions might be explored using Exploratory Data Analysis. Explain how the length of a confidence interval for a population mean depends qualitatively on the required confidence level and on the size of the simple random sample. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 151 / 279 Objectives: by the end of this section you should be able to Construct an approximate confidence interval for a population mean, with a given confidence level, based on the mean of a simple random sample from the underlying population distribution. Construct an approximate confidence interval for a proportion, with a given confidence level, based on the mean of a simple random sample from a Bernoulli distribution. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 152 / 279 Section 6.1: Introduction Example 6.1. Consider the simple case of a random sample of size n from a distribution. Suppose the population mean µ is an unknown parameter which we wish to estimate and (unrealistically) σ 2 is (say σ 2 = σ02 ). The natural estimator µ bmom = µ bmle = X . Recall from Section 4 that any estimator is random (depends on the data), with a particular sampling distribution. Hence need to report the value of the estimate, together with some measure of its accuracy or margin of error. For example, could give an nterval (centred on the estimate ) that we are 95% confident contains the the true value of µ. Knowing sampling distribution allows us to calculate this. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 153 / 279 Definition Definition 6.2. Take 0 < α < 1. An 100(1 − α)% an interval of the form (cL , cU ) (here L is for limit) such that: for a parameter θ is limit, U for upper The parameter lies in the interval with probability (1 − α) P(cL ≤ θ ≤ cU ) = 1 − α. (6.1) cL and cU are calculated only using the value of n, the sample data (x1 , . . . , xn ), and any known parameters. Remark 6.3. It is very important to understand that in Equation (6.1), θ is . cL and cU depend on the data (are random), so vary from sample to sample. (6.1) is an assertion about the joint distribution of cL and cU . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 154 / 279 Procedure for finding the interval Remark 6.4. Our procedure for finding the interval is based on some the data. It usually effectively comes in two stages: f (X) of 1. We treat θ as fixed but unknown. We use our collection of facts about distributions to find an interval depending on θ such that P (g1 (θ) ≤ f (X) ≤ g2 (θ)) = 1 − α. 2. Then we of θ: the interval, to rewrite the same interval as a function P (cL (f (X)) ≤ θ ≤ cU (f (X))) = 1 − α. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 155 / 279 Section 6.2: N(µ, σ 2 ): Confidence Interval for µ; σ 2 known Example 6.5. Return to the setting of Example 6.1 (normal sample, known σ 2 ). √ ∼ N(0, 1). From Theorem : X ∼ N(µ, σ02 /n) and σX/−µ n 0 Since (using qnorm(0.9750)) we know , then (see Remark 5.19) X −µ √ ≤ 1.96 = 0.95. P −1.96 ≤ σ0 / n But X −µ √ σ0 / n ≤ 1.96 ⇐⇒ X − µ ≤ X −µ √ σ0 / n ⇐⇒ X − 1.96 √ σ0 n ≤µ √ σ0 And −1.96 ≤ ⇐⇒ ≤ X − µ ⇐⇒ µ ≤ X + 1.96 n So the event X −µ 1.96 σ0 1.96 σ0 √ ≤ 1.96 ⇐⇒ X − √ −1.96 ≤ ≤µ≤X+ √ σ0 / n n n Oliver Johnson ([email protected]) √ σ0 − 1.96 n 1.96 √ σ0 n Statistics 1: @BristOliver c TB 2 UoB 2017 156 / 279 Example 6.5. Thus we report a 95% confidence interval with cL = X − 1.96σ0 1.96σ0 √ and cU = X + √ n n Suppose we take a large number of simple random samples from the distribution, each of fixed size n. √ In roughly of the samples the interval X ± 1.96σ0 / n will contain the true parameter value µ (and in 5% it will not). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 157 / 279 General 100(1 − α)% confidence interval Example 6.6. More generally find confidence interval. Remark 5.19 gives X −µ √ P −zα/2 ≤ ≤ = 1 − α. σ0 / n Rearranging in the same way, the event zα/2 σ0 zα/2 σ0 X −µ √ ≤ zα/2 ⇐⇒ X − √ −zα/2 ≤ ≤µ≤X+ √ σ0 / n n n We can therefore report a 100(1 − α)% confidence interval with =X− zα/2 σ0 √ and n =X+ zα/2 σ0 √ n √ In about 100(1 − α)% of the samples the interval X ± zα/2 σ0 / n will contain the true parameter value µ (and in 100α% it will not). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 158 / 279 Remark 6.7. In ∼ 100(1 − α)% of cases the interval will contain the true and in the remainder it will not. It is impossible to tell for each sample whether the interval does or does not contain µ. The interval is of the form √ √ (cL , cU ) = X − zα/2 σ0 / n, X + zα/2 σ0 / n with end-points which depend on the data as well as on the value of n and the value of the known parameter σ0 The length of the confidence interval is . All things being equal this: 1. DECREASES as the sample size n INCREASES 2. INCREASES as the population variance σ02 INCREASES 3. INCREASES as the confidence level 100(1 − α) INCREASES (since this means α DECREASES and so zα/2 INCREASES). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 159 / 279 Section 6.3: N(µ, σ 2 ): CI for µ; σ 2 unknown Example 6.8. A more realistic setting than Example 6.1 is the following: Assume X1 , . . . , Xn is a random sample from N(µ, σ 2 ) with σ 2 . Can’t apply Theorem 5.9. However, Theorem 5.18 implies that X −µ √ ∼ tn−1 , S/ n P where X = (X1 + . . . + Xn )/n and S 2 = nj=1 /(n − 1). Write tn−1;α/2 for the value such that P(T ≥ tn−1;α/2 ) = α/2, where . By symmetry P(T ≤ −tn−1;α/2 ) = α/2. Can find tn−1;α/2 using qt(1-alpha/2, n-1). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 160 / 279 Section 6.3: N(µ, σ 2 ): CI for µ; σ 2 unknown Example 6.8. By definition (see Theorem 5.18 and Remark 5.19): X −µ √ ≤ tn−1;α/2 P −tn−1;α/2 ≤ S/ n . √ ≤ tn−1;α/2 ⇐⇒ X − Stn−1;α/2 / n ≤ µ √ X −µ √ ⇐⇒ µ ≤ X + Stn−1;α/2 / n. Similarly −tn−1;α/2 ≤ S/ n √ √ Hence P(X − Stn−1;α/2 / n ≤ µ ≤ X + Stn−1;α/2 / n) = 1 − α. √ √ Equivalently cL = X − Stn−1;α/2 / n and cU = X + Stn−1;α/2 / n define a 100(1 − α)% confidence interval for µ. √ The interval has ( ) length 2Stn−1;α/2 / n Points 1. and 3. of Remark 6.7 also apply here (larger sample size gives smaller interval, larger confidence 100(1 − α) gives larger interval). As before X −µ √ S/ n Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 161 / 279 Example: Newcomb data Example 6.9. Return to the setting of Example ; speed of light data, outliers removed (x6 = −44 and x10 = −2), leaving n = 64 data points. Nothing in histogram to contradict model that data is simple random sample from N(µ, σ 2 ) with µ and σ 2 unknown. P P Here Pni=1 xi = 1776, ni=1 xi2 = 50912, hence x = 27.75 and s 2 = ( ni=1 xi2 − n(x)2 )/(n − 1) = 25.84127 and s = 5.0834. For a confidence interval, can find t63;0.025 = 1.998341 using qt(0.975,63) in R . Substituting in Example 6.8, a 95% confidence interval for µ: √ √ (cL , cU ) = (X − Stn−1;α/2 / n, X + Stn−1;α/2 / n) 5.0834 × 1.998 5.0834 × 1.998 √ √ = 27.75 − , 27.75 + 64 64 = . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 162 / 279 Section 6.4: Confidence interval for σ 2 for N(µ, σ 2 ) data Lemma 6.10. Assume X1 , . . . , Xn is a random sample from N(µ, σ 2 ) with σ 2 . Theorem gives Pn 2 j=1 (Xj − X ) ∼ χ2n−1 . σ2 Hence for any given α, Remark 5.19 gives ! Pn 2 X ) (X − j j=1 ≤ χ2n−1;α/2 = P χ2n−1;1−α/2 ≤ σ2 . Hence, we obtain a 100(1 − α)% confidence interval for σ 2 taking ! Pn Pn 2 2 (X − X ) (X − X ) j j j=1 j=1 (cL , cU ) = , . χ2n−1;α/2 χ2n−1;1−α/2 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 163 / 279 Example: Newcomb data Example 6.11. Return to the setting of Example 1.19. P P Here n = 64, ni=1 xi = 1776, ni=1 xi2 = 50912. P P Hence ni=1 (xi − x)2 = ni=1 xi2 − n(x)2 = . R (using the qchisq(., 63) command) gives χ263;0.975 = χ263;0.025 = . Lemma 6.10 shows that Pn (cL , cU ) = 2 j=1 (Xj − X ) , χ2n−1;α/2 Pn j=1 (Xj − X ) χ2n−1;1−α/2 2 and ! gives a 100(1 − α)% confidence interval for σ 2 . Substituing in, we obtain (cL , cU ) = (1628/86.83, 1628/42.95) = (18.75, 37.9). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 164 / 279 Section 6.5: Confidence interval for θ in U(0, θ) population Lemma 6.12. Consider X1 , . . . , Xn a simple random sample from U(0, θ). Recall from that θbmle = X(n) = max(X1 , . . . , Xn ),so P(X(n) ≤ x) = P(X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x) = (x/θ)n if 0 ≤ x ≤ θ Hence, if we take x = θu 1/n , for , then X(n) 1/n P ≤u = P X(n) ≤ θu 1/n = u. θ If we choose u1 = α/2 and u2 = 1 − α/2, usual ”inversion” gives ! X(n) X(n) X(n) 1/n 1/n 1 − α = u2 − u1 = P u1 ≤ ≤ u2 =P ≤ θ ≤ 1/n 1/n θ u u 2 1 Hence we can take (cL , cU ) = X(n) (1 − α/2)−1/n , X(n) (α/2)−1/n . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 165 / 279 Section 6.6: Confidence interval for θ in Exp(θ) population Lemma 6.13. Take a simple random sample X1 , . . . , Xn from population. Want to construct a 100(1 − α)% confidence interval for P the The standard estimates for θ are θbmle = θbmom = n/ nj=1 Xj . P From Lemma 5.20 know 2θ nj=1 Xj ∼ χ22n . To construct an ‘equal tailed’ confidence interval we note that n X 1 − α = P χ22n ; 1−α/2 ≤ 2θ Xj ≤ χ22n ; α/2 j=1 = P χ22n ; 1−α/2 P 2 nj=1 Xj χ22n ; α/2 ≤ θ ≤ Pn 2 j=1 Xj ! Thus, a 100(1 − α)% Pn confidence interval2 for θ is given Pn by 2 cL = χ2n;1−α/2 /(2 j=1 Xj ) and cU = χ2n;α/2 /(2 j=1 Xj ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 166 / 279 Example - Earthquakes - 90% Confidence Interval Example 6.14. Consider again the quakes data set (Example 1.17) with n = 62 observations. From graphical plots in Example 1.17 and assessment of fit in Section 2.7 reasonable to assume the data comes from family. Pn For this dataset, n = 62, j=1 xj = 27107, so that θb = 1/x = 0.002287. Lemma gives that (cL , cU ) = Oliver Johnson ([email protected]) χ22n;1−α/2 χ22n;α/2 P , P 2 nj=1 Xj 2 nj=1 Xj Statistics 1: @BristOliver ! . c TB 2 UoB 2017 167 / 279 Example - Earthquakes - 90% Confidence Interval Example 6.14. We illustrate the effect of different in the following table. Increasing 1 − α leads to a wider interval, as you might expect. The values of χ2124;1−α/2 and χ2124;α/2 are obtained using the command qchisq(.,124). 100(1 − α) χ2124;1−α/2 χ2124;α/2 100 × cL 100 × cR 100 × length 90% 99.28 150.99 0.1831 0.2785 0.0954 95% 95.07 156.71 0.1754 0.2891 0.1137 99% 87.19 168.31 0.1608 0.3105 0.1496 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 168 / 279 Section 6.7: Confidence intervals by simulation in R Sometimes we do not have the distributional facts required to construct an exact confidence interval. Suppose we have a simple random sample X1 , . . . , Xn of size n from a distribution in a parametric family with a density f (x; θ) for single unknown parameter θ. We can construct an approximate simulation as follows: 1. Calculate an estimate confidence interval by for θ. 2. Simulate simple random samples, each of the same size n as b the original sample, from f (x, θ). 3. Calculate the B estimates, θ1∗ , . . . , θB∗ , one from each simulated sample, using the same estimation method as in step 1 above. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 169 / 279 Confidence intervals by simulation in R b If is close to θ, then the 4. Calculate the values of θk∗ − θ. ∗ distribution of the values of θ − θb for samples from the distribution with parameter θb will be close to the distribution of θb − θ for samples from the distribution with parameter θ. 5. Identify values and such that Bα/2 of the values of θk∗ − θb are < kL and Bα/2 are > kU . Then step 4 gives: P(kL ≤ θb − θ ≤ kU ) ' P(kL ≤ θ∗ − θb ≤ kU ) ' 1 − α. 6. The event {kL ≤ θb − θ ≤ kU } is equivalent to the event {θb − kU ≤ θ ≤ θb − kL }, so for B large the interval (cL , cU ) is an approximate 100(1 − α)% confidence interval for θ, where Oliver Johnson ([email protected]) cL = θb − kU and cU = θb − kL . Statistics 1: @BristOliver c TB 2 UoB 2017 170 / 279 Example - Earthquakes - 90% Confidence Interval Example 6.15. Consider again the quakes data set (Example 1.17 and 6.14). Apply the following R commands. > theta.hat <- 1/mean(quakes) > xsamples <- matrix(rexp(62000,theta.hat), nrow=1000) > xmean <- apply(xsamples,1,mean) > theta.star <- 1/xmean > diff.theta <- (theta.star - theta.hat) > sort.diff <- sort(diff.theta) > sort.diff[c(50,950)] > cl <- theta.hat - sort.diff[950] > cu <- theta.hat - sort.diff[50] Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 171 / 279 Interpreting these R commands Example 6.15. First calculate for the quakes data; then generate 62, 000 b distribution, and arranges them into a observations from the Exp(θ) matrix of samples each with n = 62 observations. Next means calculates a vector of means and hence vector of estimates θk∗ for the 1000 samples; calculates the vector of differences b sorts these differences in order of increasing value and puts θk∗ − θ; the sorted values in a vector sort.diff. Finally, we want a confidence interval, so α/2 = 0.05. Thus the last three commands output the 50th and the 950th of the 1000 ordered values of θk∗ − θb (i.e. the 5th and the 95th quantiles of the ordered differences); calculate cL = θb − kU ; and calculate cU = θb − kL . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 172 / 279 Example 6.15. A histogram of the given below. differences for a particular simulation is Recall that the estimate of θ here is θb = 0.00229. For this simulation the 5th and 95th quantiles were kL = −0.00039 and kU = 0.00058 respectively, so the confidence interval calculated from the simulation had end points cL = θb − kU = 0.00171 and cU = θb − kL = 0.00268. This compares well with the exact 90% confidence interval, which has end points (Example 6.14) cL = 0.001831 and cU = 0.002785, calculated using R . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 173 / 279 Example 6.15. 150 0 50 100 Frequency 200 250 Histogram of diff.theta −0.0005 0.0000 0.0005 0.0010 0.0015 diff.theta Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 174 / 279 Section 7: Hypothesis Tests Aims of this section: A hypothesis test is a procedure for evaluating whether sample data is consistent with one of two contrasting statements about the value taken by one (or more) population parameters. We will focus on the case when the data is in the form of a simple random sample from a single Normal population and the parameter of interest is the population mean. Suggested reading: Rice Oliver Johnson ([email protected]) Sections 9.1–9.5 Statistics 1: @BristOliver c TB 2 UoB 2017 175 / 279 Objectives: by the end of this section you should be able to Recall the definition of the following terms: null hypothesis, alternative hypothesis, p-value, significance level, critical region, type I and type II error, power. Perform standard hypothesis tests on the value of the population mean, based on a simple random sample from a Normal distribution with either known or unknown variance. Starting with an informal problem description, formulate appropriate statements of any model assumptions and of the null and alternative hypotheses of interest. In standard cases, identify an appropriate test statistic and state its distribution under the null hypothesis. For each of the standard types of alternative hypothesis, identify the set of values of the test statistic that are at least as extreme as a given observed value. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 176 / 279 Objectives: by the end of this section you should be able to In standard cases, calculate the p-value corresponding to a given alternative hypothesis and a given observed value of the test statistic. In standard cases, identify the form of the critical region for a test with a given significance level for each of the standard types of alternative hypothesis. In standard cases, calculate the probability of a type II error for a test with a given significance level. In standard cases, calculate the power against a given simple alternative hypothesis for a test with a given significance level. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 177 / 279 Section 7.1: Introduction Definition 7.1. A hypothesis is a statement about a parameter – e.g. µ = 4.2 or 2 < σ < 5. Establishing consistency with the data is posed as a competition between the hypothesis and the alternative hypothesis , although the two are not treated symmetrically. We consider whether H0 is consistent with the data x1 , x2 , . . . , xn (or if a value of θ allowed by H1 is preferable). Ask: “Is the sample data an unlikely thing to observe under H0 ?”. H1 is mainly present to define the direction of departures from H0 that are regarded as interesting. For example, if testing whether broadband speeds are at least a specified amount µ0 , we would test H0 : µ = µ0 against the alternative H1 : µ < µ0 , since the consumer is happy to get too much! Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 178 / 279 Section 7.2: Hypothesis-testing procedure Remark 7.2. Hypothesis testing is not the same as deciding whether H0 is true or not. Data will be consistent with two or more hypotheses that contradict each other! Definition 7.3. At its simplest, a hypothesis-testing procedure requires the following steps: 1. Statement of any model , 2. Statement of the null hypothesis and the alternative hypothesis of interest, 3. Calculation of the value of an appropriate test statistic, 4. Computation of the resulting p-value, 5. Report on any conclusions. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 179 / 279 Stage 1: Model Assumptions As with any statistical procedure, we start with a probability model for the data. We will assume that the data is a simple random sample from a particular distribution in a known parametric family. We will first focus on the case when the parameter of interest is the population mean . Example 7.4. As in Example 6.1, we make the following assumptions (a) x1 , . . . , xn are the observed values of a random sample X1 , . . . , Xn , . . . (b) . . . from a population with the N(µ, σ 2 ) distribution, where µ is unknown but the value of σ 2 is known – say σ 2 = σ02 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 180 / 279 Stage 2: Null Hypothesis H0 Often the null hypothesis is that of or no effect. e.g. hypothesis that current population looks like some previous reference population. Hence parameter values are similar for current population and the reference population. That is why we call it the . Denote the known mean for the previous population by µ0 and denote the unknown mean for the current population by µ, then the null hypothesis takes the form . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 181 / 279 Stage 2: Alternative Hypothesis H1 We write the null hypothesis as H0 : µ = µ0 . We restrict attention to three standard cases for H1 : (a) the current mean is than its previous value, i.e. H1 : µ > µ0 (b) the current mean is less than its previous value, i.e. H1 : µ < µ0 (c) the current mean differs from its previous value, i.e. H1 : µ 6= µ0 . Note that for large sample sizes, a small difference between the current parameter and the reference parameter may be statistically but not of practical importance. Example 7.4. The reference value of the mean is some known value µ0 and we are interested in whether the data leads us to conclude the population has mean µ > µ0 . Null hypothesis is H0 : µ = µ0 (no difference between the means) Alternative hypothesis H1 : µ > µ0 (new mean being greater) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 182 / 279 Stage 3: Test Statistic To summarise the case that the data provides (for or against H0 ) we use the value of a suitable T (X1 , . . . , Xn ), i.e. a function of the data with the following properties: (a) values of the test statistic would be highly unlikely if H0 were true and suggest that H0 is in fact false, (b) when µ = µ0 (i.e. when H0 is true) the distribution of T is known and its distribution function can be easily calculated. We write tobs for the observed value of the test statistic. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 183 / 279 Stage 3: Test Statistic Example 7.4. Since X is the natural estimator of µ, we base our test statistic on Since the population standard deviation σ0 is assumed known we can take as our test statistic √ n(X − µ0 ) . T (X1 , . . . , Xn ) = σ0 Then from Theorem 5.9(ii), when H0 is true (i.e. when µ = µ0 ) we have X ∼ N(µ0 , σ02 ) and We write the observed value of the test statistic √ n(x − µ0 ) tobs = . σ0 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 184 / 279 Stage 4: p-value approach: Consistency with H0 If tobs = T (x1 , . . . , xn ) is relatively with H0 then it provides little or no reason to believe that H0 is untrue. Thus, given value , we’d like to know the values of the test statistic T which would be less consistent with H0 and more consistent with H1 than t. Obviously, this set of values depends on the particular form of H1 . Statistics 2 explains these are the set of values whose relative likelihood of occurring under H0 rather than H1 is less than that for t. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 185 / 279 Stage 4 p-value Definition 7.5. Compute the probability (assuming H0 is true) of test statistic value less consistent with H0 (and more consistent with H1 ) than tobs . We call this probability the corresponding to tobs . For each alternative, we calculate the p-value as follows: (a) H1 : µ > µ0 ⇒ p-value = P(T ≥ tobs |H0 true) (b) H1 : µ < µ0 ⇒ p-value = P(T ≤ tobs |H0 true) (c) H1 : µ 6= µ0 ⇒ p-value = P(|T | ≥ |tobs | |H0 true). Small p-values make us more sceptical of H0 Example 7.4. Since the alternative of interest is H1 : µ > µ0 , the values of T which are less consistent with H0 than tobs are the set of values {T ≥ tobs }. Also, when H0 is true, T ∼ Z ∼ N(0, 1). p-value = P(T ≥ tobs | H0 true) = P(Z ≥ tobs ) = 1 − Φ(tobs ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 186 / 279 Stage 5: Conclusions – Interpretation of the p-value If the p-value is very – the level of consistency with H0 is very small – we have reason to believe that either the null hypothesis H0 : µ = µ0 is false (or that something very unlikely has happened). Thus, small p-values may well lead us to reject H0 in favour of H1 . Conversely, if the p-value is relatively , then these observations are relatively likely to occur when H0 is true, and we conclude that there is no reason to reject H0 . In giving conclusions we should (a) report the p-value (b) interpret that to make practical conclusions about µ in the context of the example. (i.e. does the test show that the sample data gives a reason to reject H0 in favour of H1 or not?) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 187 / 279 Section 7.3: Example - Normal distribution with known variance Example 7.6. When patients with a certain chronic illness are treated with the current standard medication, the mean time to recurrence of the illness is µ0 = days, with a standard deviation of σ0 = 26.4 days. A new type of medication, that is thought to increase the time until recurrence, was tried by a randomly chosen sample of patients. For this sample, the mean time to recurrence was x = 65.8 days. Assuming the variance of the recovery time is the same for the new and current medication, we want to test whether the new medication has increased the mean time to recurrence. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 188 / 279 Stages 1. and 2. Model assumptions and hypotheses Example 7.6. (a) The recurrence times for the patients are a random sample from the population of recurrence times for all patients that will use this new medication . . . (b) . . . with distribution N(µ, σ 2 ), where µ is but the value σ 2 = σ02 = (26.4)2 is known. I We take H0 : µ = µ0 = 53.3 versus H1 : µ > 53.3. I I The null hypothesis H0 corresponds to between the mean recurrence time µ for the new medication and the mean recurrence time µ0 = 53.3 for the standard medication. The alternative hypothesis H1 corresponds to the mean recurrence time for the new medication being longer than for the standard medication. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 189 / 279 Stage 3. Test Statistic Example 7.6. I I I I We base our test statistic on X − 53.3. Since population standard deviation σ0 = 26.4 is assumed known and n = 16, we can take as our test statistic √ √ T (X1 , . . . , Xn ) = n(X − µ0 )/σ0 = 16(X − 53.3)/26.4, where X ∼ N(µ, σ02 /n) = N(µ, (26.4)2 /16). Thus,√when (i.e. when µ = µ0 = 53.3) we have T = 16(X − 53.3)/26.4 ∼ N(0, 1). The data √ gives x = 65.8 so the observed test statistic is tobs = 16(65.8 − 53.3)/26.4 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 190 / 279 Stages 4. and 5. p-value and Conclusions Example 7.6. the values of T which are less consistent with H0 than tobs are the set of values {T ≥ tobs = 1.893} so p-value = P(T ≥ tobs | H0 true) = P(Z ≥ 1.893) = 1 − Φ(1.893) = 0.0292. I I The p-value of is quite small – if the mean for the new medication was really 53.3 we would only observe data for which the consistency with H0 was this small about 3 percent of the time. Thus there is a reasonably strong case that H0 is not true. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 191 / 279 Section 7.4: One sample t-test: µ, σ 2 unknown Examples 7.4 and 7.6 correspond to Example 6.1 – that is, we make the unrealistic assumption that σ 2 is known. We now give an equivalent to Example 6.8 – that is, testing hypotheses about µ for σ 2 . Example 7.7. I Assume X1 , . . . , Xn form a simple random sample from N(µ, σ 2 ) where µ and σ 2 are unknown. I Null hypothesis: H0 : µ = µ0 . Alternative hypothesis: one of the standard cases, either H1 : µ > µ0 , H1 : µ < µ0 or H1 : µ 6= µ0 . For definiteness (and variety) take H1 : µ 6= µ0 . I I Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 192 / 279 Example 7.7. Theorem implies that when H0 is true (i.e. when µ = µ0 ) X − µ0 √ ∼ tn−1 , S/ n P where X = (X1 + . . . + Xn )/n and S 2 = nj=1 (Xj − X )2 /(n − 1). T = I I Because we take H1 : µ 6= µ0 , we are interested in values of T such that {|T | ≥ |tobs |} ( alternative). Hence p-value = P(|T | ≥ |tobs ||H0 true) = P(|tn−1 | ≥ |tobs |) I I If p-value is large, no reason to reject H0 . If p-value is small, suggests inconsistency with H0 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 193 / 279 Example of 1-sample t-test Example 7.8. To investigate the accuracy of DIY radon detectors, researchers bought 12 such detectors and exposed them to exactly 105 picocuries per litre of radon. The 12 detector readings were: 91.9, 97.8, 111.4, 122.3, 105.4, 95.0, 103.8, 99.6, 96.6, 119.3, 104.8, 101.7. P This i xi = 1249.7, x = 104.1417, n = 12, P 2gives summary statistics 2 = 86.4181. x = 131096.44, S i i Our question: does the mean for such detectors seem to differ from 105? Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 194 / 279 Example of 1-sample t-test Example 7.8. these 12 observations are a random sample from N(µ, σ 2 ) where both µ and σ 2 are unknown. H0 : µ = 105 vs. H1 : µ 6= 105 (2-sided alternative). T = X − µ0 √ ∼ tn−1 S/ n when H0 is true We have √ n = 12 and µ0 = 105, √ so the observed value of T is tobs = 12(104.142 − 105)/ 86.4181 = −0.32 and |tobs | = 0.32. The p-value is P(|T | ≥ 0.32) when T ∼ tn−1 = t11 . Using R , we find pt(0.32,11)=0.6225 so P(|T | ≥ 0.32) = 2(1 − 0.6225) = 0.755. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 195 / 279 Section 7.5: Alternative approach: Critical region An alternative approach (avoiding p-value calculations) is to define a critical region. In advance of the test, we define a threshold for the test statistic, which if crossed will lead us to reject the null hypothesis. Fixing (and publicising?) this value in advance can encourage scientific honesty. Thinking in terms of critical regions is useful philosophically e.g. when thinking about optimal test procedures (see courses in later years). In this context, we introduce the useful ideas of Type I and Type II error. The outcome of such tests is less informative than a p-value, but easier to calculate without R, so this approach is often found in earlier textbooks. Critical regions provide an alternative for Stage 4 - the other stages of the procedure are unchanged. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 196 / 279 Type I and Type II error One way of evaluating the performance of a test procedure is to focus on some simple fixed alternative hypothesis H1 : µ = µ1 , and assume that µ can only take one of the two values µ = µ0 or µ = µ1 . In this simplified context, there are only two possible errors. Definition 7.9. error is the error of deciding the null hypothesis H0 is false when H0 is actually true, error is the error of deciding the null hypothesis H0 is true when H1 is actually true. There is a trade-off between type I and type II error. A change to the test procedure that reduces the type I error will usually increase the type II error, and vice-versa. Control of the type I error is often thought of as being in some senseTBthe 0 represents c status-quo. Oliver Johnson ([email protected]), since HStatistics 1: @BristOliver 2 UoB 2017 197 / 279 Significance level Definition 7.10. Often we fix in advance some small acceptable threshold level (e.g. α = 0.05 or 0.01) for the type I error. We call this the of the test, and speak of an α-level test. Fixing significance level α in turn fixes the (set of values of T that would lead us to reject H0 ) and c ∗. For significance level α and , the critical value c ∗ satisfies: P(T ≥ c ∗ | H0 true) = P(Reject H0 | H0 true) = P(Type I error) = α Corresponding conditions hold for other two alternative hypotheses. Remark 7.11. By definition, an α-level test procedure will reject H0 if and only if the calculated p-value is less than or equal to α. e.g. P(T ≥ tobs | H0 true) ≤ α = P(T ≥ c ∗ | H0 true) iff tobs ≥ c ∗ . Hence reporting if T ∈ C is less informative than the p-value. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 198 / 279 Return to Example 7.4 Example 7.4. Since H1 : µ > µ0 , the values {T ≥ t} are those which are less consistent with H0 than t. Equivalently the critical region of values for which the test would reject H0 is of the form To find c ∗ for a given significance level α, we recall that a test has significance level α if P(Reject H0 | H0 true) = α. Thus, for a α-level test, c ∗ is defined by the condition α = P(Reject H0 |H0 true) = P(T ≥ c ∗ | H0 true) = P(Z ≥ c ∗ ) = 1 − Φ(c ∗ ). Hence c ∗ = Φ−1 (1 − α). For α = 0.05, the c ∗ = Φ−1 (0.95) = We reject H0 at 0.05 significance if T > c ∗ = 1.645, or if √ √ X ≥ µ0 + 1.645σ0 / n (since T = n(X − µ)/σ0 ) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 . 199 / 279 Further numerical examples Example 7.12. Example 7.6: the critical region of values C has the form C = {T ≥ c ∗ }. Since T ∼ N(0, 1), and P(N(0, 1) > 1.645) = 0.05, we take as critical region Since tobs = 1.893 is in C , the 0.05-level test would lead us to reject H0 . Example 7.7: Form of alternative hypothesis implies we are looking for a critical region of the form C = {|T | ≥ c ∗ }. For α-level test, c ∗ is defined by α = P(|T | ≥ c ∗ | H0 true) = P(|tn−1 | ≥ c ∗ ). The relevant c ∗ is found in R as qt(1-alpha/2,n-1). As before, tobs in critical region means we reject H0 (at a α significance level). Example 7.8: For an α level test we use the critical region C = {|T | ≥ t11;α/2 }. Let us take the significance level α = 0.05. Using R gives qt(0.975,11)=2.201. The value of tobs = 0.32 is not within C = {|T | ≥ 2.201}, so do not reject H0 at the 5% level. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 200 / 279 P(Type II error) and Power Definition 7.13. For a fixed alternative hypothesis H1 : µ = µ1 , we define the of the test to be 1 − P(Type II error). It gives a measure of how powerful the procedure would be in detecting that the alternative H1 is true. Example 7.14. Consider a simple fixed alternative of the form Under our test procedure for this alternative, we accept H0 as true if and only if T < c ∗ . Note that P( ) = P(Accept H0 | H1 true) = P(T < c ∗ |µ = µ1 ). Hence the power 1 − P(Type II error) = P(T ≥ c ∗ |µ = µ1 ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 201 / 279 t-tests in R Example 7.15. The R command: > t.test(data, mu = 0, alternative="greater", conf.level=0.9) will compute a one-sample t-test on observations in an array data, with null hypothesis H0 : µ = 0, with alternative hypothesis H1 : µ > 0, and at significance level α = 0.1. The numerical mean value 0 can be replaced by the value appropriate for your data, the alternative hypothesis ”greater” can be replaced by the alternatives ”less” or ”two.sided” as desired, and the significance level can be changed by setting conf.level to 1 − α (the default value is α = 0.05). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 202 / 279 Section 7.6: Confidence Intervals and Hypothesis Tests Theorem 7.16. Hypothesis tests are closely related to confidence intervals. In particular, the α-level test of H0 : µ = µ0 versus the two-sided alternative H1 : µ 6= µ0 will reject H0 if and only if the corresponding two-sided 100(1 − α)% confidence interval for µ does not contain µ0 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 203 / 279 Confidence Intervals and Hypothesis Tests Example 7.17. If we know σ 2 = σ02 , then reject H0 if and only if √ n(X − µ0 ) ∗ , c ≤ |T | = σ0 where α/2 = 1 − Φ(c ∗ ), or c ∗ = zα/2 . Rearranging, this means that we reject if and only if zα/2 σ0 zα/2 σ0 √ √ µ0 ∈ / X− ,X − , n n which we recognise from Example 6.6 as the two-sided 100(1 − α)% confidence interval. Good exercise to check similar equivalence holds when σ 2 is unknown. Similar results connect one-sided tests and one-sided confidence intervals of the form (−∞, cU ) or (cL , ∞). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 204 / 279 Section 8: Comparison of population means Aims of this section: In the last section we introduced hypothesis tests in the context of samples from a single distribution. In this section we will use hypothesis tests to compare samples from distributions that differ in some qualitative factor. In particular, we will investigate situations where it is thought that the change in the qualitative factor may have had the effect of increasing or decreasing the population mean. Suggested reading: Rice Oliver Johnson ([email protected]) Sections 11.1–11.3 Statistics 1: @BristOliver c TB 2 UoB 2017 205 / 279 Objectives: by the end of this section you should be able to Identify from an informal problem description situations where a paired t-test or a two sample t-test would be appropriate, and, in the latter case, identify whether or not a pooled two sample t-test would be appropriate. State appropriate model assumptions and formulate appropriate null and alternative hypotheses for each type of two-sample or paired t-test listed above. Identify an appropriate test statistic and its distribution under the null hypothesis for each type of two-sample or paired t-test above. Use the methods of this section to compute appropriate p-values or critical regions and report appropriate conclusions for each type of two-sample or paired t-test listed above. Use appropriate commands in R to perform each of the types of two-sample or paired t-test listed above, and correctly interpret and report the output of the procedures. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 206 / 279 Section 8.1: Introduction So far, we have modelled data as a random sample from a fixed parametric distribution. This model applies to experimental units of the same type, where differences in the data values come from random variation. In reality, data are usually collected in order to compare groups, and/or to study how the main variable of interest (the variable) depends on one or more variables. These are really versions of the same question – we can think of each data item being accompanied by a indicating its group (age groups, different treatments etc.). By studying how the response variable depends on this label, we are comparing groups. In this case, the explanatory variable is discrete or categorical, and often referred to as a factor. In Sections 9 and 10 we consider regression, where there is a single numerical explanatory variable. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 207 / 279 Section 8.2: Comparison of two groups In this section, we suppose we have two distinct groups of data (so the explanatory variable is ), and suspect there are systematic differences between the groups. The groups might be defined by properties of the experimental units (human subjects, etc.) or by different treatments (e.g. drug therapies) applied to them. We want to know if there are systematic differences in some quantity of interest, between the populations (corresponding to differences in the ). The response variable will be influenced by both systematic and random variation. The statistician’s task is these two effects. For simplicity, we only consider the case of normally distributed data with the quantity of interest being the population mean. We test whether observed differences are statistically significant. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 208 / 279 Case 1: Independent samples Sometimes, we can assume each data set is entirely one other. of In this case the data can be modelled as two independent random samples from different population distributions. Here the question of interest reduces to whether the means of the two populations differ. This type of model can be analysed using a two sample t-test (see Section 8.3) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 209 / 279 Case 2: Paired (or matched) samples Alternatively, the data consist of pairs of observations on each of n experimental units, with different treatments applied to each member of the pair. The first observation in each corresponds to one factor value and the second corresponds to the other. We may assume that the change in factor value is associated with a common systematic change in the underlying distribution of the variable being measured. An appropriate model is often that between observations in each pair are independent observations from the same distribution, whose mean corresponds to the systematic change. The question of interest reduces to whether this mean change is zero. This model can be analysed by a Oliver Johnson ([email protected]) Statistics 1: @BristOliver (see Section 8.4). c TB 2 UoB 2017 210 / 279 Section 8.3: Two sample t-test Example 8.1. Model assumptions are that there are two independent samples. The X1 , . . . , Xn is a random sample of size n from the N(µX , σX2 ) distribution, with sample mean X = (X1 + · · · + Xn )/n. Y1 , . . . , Ym is a random sample of size m from the N(µY , σY2 ) distribution, with sample mean Y = (Y1 + · · · + Ym )/m. The null hypothesis of interest is H0 : µX − µY = 0. The standard estimators of µX and µY are X and Y , so it is natural to base our analysis on the value of X − Y . From Theorem , X ∼ N(µX , σX2 /n) and Y ∼ N(µY , σY2 /m), so from Lemma , X − Y ∼ N(µX − µY , σX2 /n + σY2 /m). Moreover, when H0 is true, and Lemma gives s σX2 σ2 (X − Y )/ + Y ∼ N(0, 1). (8.1) n m Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 211 / 279 Example 8.1. Since σX2 and σY2 are unknown, we replace them by appropriate estimates and use as test statistic s 2 2 σc σc X + Y. T = (X − Y )/ n m We reject H0 if the value of the test statistic is significantly different from zero, where the relevant direction depends on H1 . The situation is slightly more complicated than the single sample case, where the resulting test statistic has a standard t-distribution: 1 2 If we can assume the X and Y distributions have the same variance, c 2 2 we combine estimates σc estimate Sp2 , X and σY into a single and the resulting test statistic has a t-distribution (see Example 8.2) If we cannot make this assumption, then a result due to shows that the distribution of the test statistic can be approximated by a t-distribution with non-integer degrees of freedom (see Example 8.3). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 212 / 279 Pooled two sample t-test Example 8.2 (Pooled two sample t-test). Here we are prepared to make the extra model assumption that = (say) σ 2 . P DenotePthe sample variances by SX2 = ni=1 (Xi − X )2 /(n − 1) and 2 SY2 = m j=1 (Yj − Y ) /(m − 1). Since both of these are independent estimates of the common variance σ 2 , we can combine them into the estimate Pn Pm 2 2 (n − 1)SX2 + (m − 1)SY2 i=1 (Xi − X ) + j=1 (Yj − Y ) 2 = . Sp = n+m−2 n+m−2 The test statistic then becomes ! r 1 1 T = (X − Y )/ Sp + where T ∼ tn+m−2 when H0 is true. n m Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 213 / 279 Example 8.2. To see that T q does have the claimed distribution, q note that we can 2 2 write T = U/ Sp /σ , where U = (X − Y )/σ n1 + m1 . From (8.1) and since σX2 = σY2 = σ 2 , U ∼ N(0, 1) when H0 is true. Further, from Theorem n X i=1 2 (Xi − X ) /σX2 we have that (independently): ∼ χ2n−1 and m X (Yj − Y )2 /σY2 ∼ χ2m−1 . j=1 Since samples independent and σX2 = σY2 = σ 2 , Lemma 5.14 gives Sp2 = σ2 Pn i=1 (Xi − X )2 + Pm j=1 (Yj σ 2 (n + m − 2) Thus, from Definition 5.16, Oliver Johnson ([email protected]) Statistics 1: @BristOliver − Y )2 ∼ χ2n+m−2 . n+m−2 when H0 is true. c TB 2 UoB 2017 214 / 279 Welch two sample t-test Example 8.3 (Welch two sample t-test). In the general case, when , the natural estimators of the population variances are the corresponding sample variances. Put 2 σc X 2 σc Y = SX2 n X = (Xi − X )2 /(n − 1), i=1 = SY2 = m X j=1 (Yj − Y )2 /(m − 1). The test statistic is then: s T = (X − Y )/ Oliver Johnson ([email protected]) 2 2 σc σc X + Y. n m Statistics 1: @BristOliver c TB 2 UoB 2017 215 / 279 Example 8.3. A result due to Welch shows that: T ' tν when H0 is true. The degrees of freedom are computed as: ν= SY2 2 SX2 + n m 2 2 2 SX SY 1 1 + n−1 n m−1 m 2 . Note that, when the X and Y distributions have similar unimodal shapes, the approximation to the distribution of the test statistic is reasonably good for . Note also that, when the sample sizes m, n and sample variances SX2 , SY2 are similar for the two samples, then the degrees of freedom ν will be close to the value n + m − 2 used in the pooled test. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 216 / 279 Cider maker: two sample t-test Example 8.4. A cider maker tests if an insecticide total crop per tree. From trees, he chooses at random and sprays them, leaving remaining ones unsprayed, and otherwise treats them identically. Write X1 , . . . , X11 for yields from 11 sprayed trees. Write Y1 , . . . , Y12 for yields from 12 unsprayed trees. These yields (in kg) are available in the files apple.sprayed and apple.unsprayed in Stats1.RData. The summary statistics are: n = 11, x = 42.56, sx2 = 10.04855, m = 12, y = 40.49, sy2 = 6.660833, sp2 = 8.27403. Assume two independent samples: X1 , . . . , Xn is a random sample of size n from the N(µX , σX2 ) distribution and Y1 , . . . , Ym is a random sample of size m from the N(µY , σY2 ) distribution. Compare H0 : µX − µY = 0 vs H1 : µX − µY > 0. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 217 / 279 Cider maker: pooled two sample t-test Example 8.4. If we assume σX2 = σY2 = σ 2 , as in Example 8.2, we can use Sp2 . Substituting the values above, we obtain n + m − 2 = 21, ! r 1 1 + = 1.7256. tobs = (x − y )/ sp n m The form of H1 implies that we are interested in values of T ≥ tobs . Since R gives pt(1.7256,21) = 0.95045), = P(T ≥ tobs |H0 true) = P(t21 ≥ 1.7256) = 0.05 : need α = P(T ≥ c ∗ | H0 true), so c ∗ = t21,α . e.g. for α = 0.05, using qt(0.95,21) in R gives c ∗ = t21,0.05 = 1.7207, so C = {T ≥ 1.7207}. Hence p-value close to 0.05 and tobs close to edge of critical region, some evidence spraying increases mean. c Oliver Johnson ([email protected]) Statistics 1: @BristOliver TB 2 UoB 2017 218 / 279 Cider maker: Welch t-test Example 8.4. If we do not assume σX2 = σY2 , as in Example 8.3, we can use the Welch approximation. In this case q tobs = (x − y )/ sx2 /n + sy2 /m . The degrees of freedom are computed as: ν= Oliver Johnson ([email protected]) SY2 2 SX2 n + m 2 2 2 SX SY 1 1 + n−1 n m−1 m Statistics 1: @BristOliver 2 = . c TB 2 UoB 2017 219 / 279 Cider maker: Welch t-test Example 8.4. The form of H1 implies that we are interested in values of T ≥ tobs . R gives pt(1.7098,19.35) = 0.94835), so = P(T ≥ tobs |H0 true) = P(t19.35 ≥ 1.7098) = 0.052 : need α = P(T ≥ c ∗ | H0 true), so c ∗ = t19.35,α . e.g. for α = 0.05, using qt(0.95,19.35) in R gives c ∗ = t19.35,0.05 = 1.7275, so C = {T ≥ 1.7275}. Hence p-value close to 0.05 and tobs close to edge of critical region, some evidence spraying increases mean. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 220 / 279 Section 8.4: Paired t-test Example 8.5. In certain circumstances (e.g. twin studies in medicine), we may have reason to pair up the sample data values as (X1 , Y1 ), . . . , (Xn , Yn ). Denote the difference between values in each pair by Wi = Xi − Yi . The model assumption is then that W1 , . . . , Wn are a random sample from the N(δ, σ 2 ) distribution, where δ and σ 2 are unknown. The null hypothesis is H0 : δ = 0. The analysis follows as for the one sample t-test in Example . Pn 2 d 2 = S2 = i=1 (Wi −W ) . Put W = and σ W W (n−1) The test statistic is then √ nW T = where T ∼ tn−1 when H0 is true. σ d W We reject H0 if the value of the test statistic is significantly different from zero, where the direction of the difference depends on H1 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 221 / 279 Remark 8.6. Note that the model assumptions do not necessarily require that X1 , . . . , Xn all have the same distribution. For example, suppose that each Xi ∼ N( , τ 2 ) and that each Yi ∼ N( , τ 2 ), where the underlying mean values µi may all be different. This systematic difference is the same, which would still be consistent with the model assumptions above, since it still implies that each Xi − Yi ∼ N(δ, σ 2 ), where This type of experimental design may be particularly appropriate if the experimental units are quite variable. Following example shows small but consistent systematic differences may show up in an experiment that uses paired observations, but not using two independent samples, if small differences in mean are masked by the high variability between the experimental units. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 222 / 279 Example: paired t-test Example 8.7. To test two water-repellents, cut in half. garments of different materials were One half was treated with repellent A, the other with repellent B. They were placed in a wet environment, and absorbed in grams was as follows: Garment 1 2 3 4 Treatment A Xi 1.7 4.3 14.6 5.0 Treatment B Yi 1.4 3.9 14.2 4.2 Differences Wi 0.3 0.4 0.8 the amount of water 5 2.2 2.0 We assume that W1 , . . . , W5 is a random sample from N(δ, σ 2 ), where both parameters are . We test the hypothesis H0 : δ = 0 (no systematic difference) against H1 : δ 6= 0 (some difference). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 223 / 279 Example: paired t-test Example 8.7. 2 = Put W = (W1 + · · · + Wn )/n and SW Pn − W )2 /(n − 1). i=1 (Wi For the given data w = 0.42, sw2 = 0.052. p √ The observed test statistic is tobs = 5w / sw2 = 4.118. The test statistic T ∼ t4 when H0 is true. The alternative is two-sided, so interested in the set {|T | ≥ tobs }. In terms of p-value, need P(|t4 | ≥ 4.118) = 2(1 − P(t4 ≤ 4.118)). pt(4.118,4) = 0.9926, so deduce P(|t4 | ≥ 4.118) = 0.015. In terms of , need α = P(|t4 | ≥ c ∗ ) = 2P(t4 ≥ c ∗ ). Hence for α = 0.05, need c ∗ = t4;0.025 = 2.776 and C = {|T | ≥ 2.776}. Either approach shows there is good to reject H0 – there is evidence of significant difference between the two treatments. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 224 / 279 Two-sample and paired t-test Remark 8.8. The Xi weights are not independent of Yi weights. The variability in Xi values is very large relative to variability in Wi . If you compare the two datasets in the water repellent example above (Example 8.7) using a non-paired t test, you will not find anything significant. For example, using t.test in R (see below) gives a p-value of 0.901. However, the paired test and isolates the treatment effect. Oliver Johnson ([email protected]) compares the differences Wi , Statistics 1: @BristOliver c TB 2 UoB 2017 225 / 279 Section 8.5: t-test procedures in R Assume the two random samples are in data arrays xdata and ydata. A test of the null hypothesis H0 : µX − µY = 0 against the two sided alternative H1 : µX − µY 6= 0 can be performed using the command > t.test(xdata,ydata) The output includes the value of the test statistic, the degrees of freedom ν for the approximating t-distribution and the (approximate) p-value. The option alternative="less" can be used to test against the alternative as in the command > t.test(xdata,ydata,alternative="less") Similarly the option alternative="greater" can be used to test against the alternative H1 : µX − µY > 0. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 226 / 279 Specific forms of the two sample t-test The default two sample t-test in R is the test. Under the model assumption that the population variances are equal, a pooled t-test of the null hypothesis H0 : µX − µY = 0 agaist the two sided alternative H1 : µX − µY 6= 0 can be performed using the command > t.test(xdata,ydata,var.equal=T) For the t-test, the data is assumed to be in equal-length data arrays xdata and ydata, where each component of xdata will be paired with the corresponding component of ydata. A paired t-test of the null hypothesis H0 : δ = 0 agaist the two sided alternative H1 : δ 6= 0 can then be performed using the command > t.test(xdata,ydata,paired=T) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 227 / 279 Section 9: Linear regression Aims of this section: In this section we provide a brief introduction to the ideas and methods of simple linear regression. Suggested reading: Rice Oliver Johnson ([email protected]) Section 14.1 and 14.2 Statistics 1: @BristOliver c TB 2 UoB 2017 228 / 279 Objectives: by the end of this section you should be able to State the model assumptions under which a simple linear regression model is appropriate for describing and analysing a set of data consisting of predictor values and corresponding responses. Produce a scatter plot of response values against predictor values, both by hand and in R . Compute least squares estimates of the slope and intercept of the fitted regression line, by hand and in R , and add the line to a scatter plot of the data. Comment critically on any deviations from the assumptions of the model that are apparent from the plot of the data values together with the fitted regression line. Compute the fitted values and the residual values, plot the residual values against either the predictor values or the fitted values, and comment critically on any deviations from the assumptions of the model that are apparent from the plot. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 229 / 279 Section 9.1: Introduction Instead of discrete groups (as in Section 8), we compare populations of potential Y values for different values of a variable x. In this course, we only consider one-dimensional x, and a linear dependence of Y on x, but the subject of Regression becomes much more general later – see (Generalized) Linear Models course. In this notation, the data consist of a set of n pairs of values (x1 , y1 ), . . . , (xn , yn ), corresponding to the n members of our sample. Example 9.1. We have data on the heights x and weights y of a sample of students and are interested in how well height can be used to predict weight. Example 9.2. We have data on debt y and parental income x for a sample of students, and wish to investigate whether parental income can explain debt. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 230 / 279 Types of variables We are interested in whether the associated variable xi can help or predict values of the variable yi of interest. The two variables play different roles, in that the original variable y often the associated variable x. For example, changes in weight do not usually cause changes in height, but a change in height (through growth) is usually associated with an increase in weight. Definition 9.3. For that reason the variable of interest (our Y variable) is called the (an old-fashioned term is the ). The associated variable (our x variable) is known as the predictor variable or the variable (formerly the independent variable). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 231 / 279 Random effects We also need to take account of between the x and Y values. variation in the relationship For example, if we took repeated samples, then (even if the x values were kept the same) the y values obtained would usually vary from sample to sample. Thus an appropriate framework is to assume that for each value x of the explanatory variable there is a corresponding of values of Y with its own x-dependent distribution. We call the function g (x) given by g (x) = E(Y |x) the regression of Y on x. In this framework, we look for a simple expression for g (x) = E(Y |x) which is plausible over an appropriate range of x values. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 232 / 279 Linear regression model Definition 9.4. The simple x is of the form model says the relationship of E(Y |x) to E(Y |x) = α + βx For this model, the basic questions of interest are: What are good estimates of the unknown parameters (assuming the model is correct)? and How well do the data fit the model – is there any evidence from the data that the model is not correct? What evidence is there that Y really does depend on x (i.e. that )? Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 233 / 279 Section 9.2: Model assumptions Definition 9.5. Let x1 , . . . , xn be the values observed of a predictor variable X . For each i = 1, . . . , n assume the value yi of the an observed value of random variable Yi , where Yi = Here α and β are variable is + βxi + ei . parameters. The ei are random variables, which we think of as errors. We assume that I I I Eei = 0, Var (ei ) = σ 2 (unknown), Cov (ei , ej ) = 0 for i 6= j (errors are uncorrelated). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 234 / 279 Equivalent model and summary statistics Lemma 9.6. Under our assumptions, for given x1 , . . . , xn , it is equivalent to say that EYi = Var (Yi ) = , σ2, Cov (Yi , Yj ) = 0 for i 6= j ( are uncorrelated). Definition 9.7. To simplify notation, we introduce summary statistics of the data. Write P = ni=1 xi /n, P y = ni=1 yi /n, P P ssxx = ni=1 (xi − x)2 = ni=1 xi2 − nx 2 , (cf sample variance) P P ssyy = ni=1 (yi − y )2 = ni=1 yi2 − ny 2 , P P ssxy = ni=1 (xi − x)(yi − y ) = ni=1 xi yi − nxy . (‘sample covariance’) Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 235 / 279 Section 9.3: Least squares estimates Definition 9.8. For a given model, define the least squares estimates to be the parameter values that minimise n n X X 2 (yi − g (xi )) = (yi − E(Yi |xi ))2 . i=1 i=1 That is, we minimise the sums of squares of (vertical) distances between yi and its expected value under the model. For model, the least squares estimates of α and β are the values α b and βb minimising n X i=1 Oliver Johnson ([email protected]) (yi − (α + βxi ))2 . Statistics 1: @BristOliver c TB 2 UoB 2017 236 / 279 Finding α b and βb Theorem 9.9. For the simple linear regression model, the least squares estimates are given by βb = ssxy /ssxx , and b α b = y − βx. Proof. We can rewrite the term inside the sum of squares as (yi − (α + βxi )) = ((yi − y ) − (α − y + βx) − β(xi − x)) . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 237 / 279 Finding α b and βb Proof. Hence the n X i=1 satisfies (yi − (α + βxi ))2 = n X i=1 ((yi − y ) − (α − y + βx) − β(xi − x))2 = ssyy + n(α − y + βx)2 + β 2 ssxx − 2βssxy . (9.1) P P Here the fact that (xi − x) = 0 and (yi − y ) = 0 makes some of the cross-terms vanish. b we can minimise the 2nd bracket by taking α b Given β, b = y − βx. 2 This leaves us choosing to minimise ssyy + β ssxx − 2βssxy , and b xx = 2ssxy . differentiating with respect to β, we find 2βss Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 238 / 279 Section 9.4: Fitted values, residuals and predictions Definition 9.10. 1 2 3 The (estimated value under the model for the ith b i. observation) are ybi = α b + βx The residual values (difference between the observed and fitted value) b i. are ebi = yi − ybi = yi − α b − βx Define the residual sum of squares RSS = n X i=1 4 2 ebi = n X i=1 by b i )2 . (Yi − α b − βx The best predictor of the value Y that would be observed at some x value for which we have no data is Oliver Johnson ([email protected]) b yb = α b + βx. Statistics 1: @BristOliver c TB 2 UoB 2017 239 / 279 Properties of these quantities Lemma 9.11. Substituting the optimal values in Equation (9.1) above 1 we deduce the extremely useful formula that RSS = ssyy − 2 2 ssxy . ssxx c2 = RSS/(n − 2). We estimate σ 2 by σ There are only n − 2 independent values of ebi (cf n − 1 independent values of xi − x in Definition 1.11). The model is chosen to minimise the sum of squares of residuals. However, systematic patterns in the residuals can indicate lack of fit in the model – see Section 9.7. Note that prediction can be less accurate if x lies outside the range of observed xi (i.e. extrapolating from data rather than interpolating). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 240 / 279 Section 9.5: Leaning Tower of Pisa Example Example 9.12. Studies by engineers on the Leaning Tower of Pisa between 1975 and 1987 recorded the following data on the tilt of the tower. Each tilt value in the table represents the difference from being vertical. The data are coded in tenths of a millimetre in excess of 2.9 metres, so the 1975 tilt of represents an actual difference of 2.9642 metres. Only the last two digits of the year are shown. The data are contained in the Statistics 1 data sets pisa.year and pisa.tilt respectively. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 241 / 279 Leaning Tower of Pisa Example Example 9.12. The summary for the data set P statisticsP P are: n xi = 1053 yi = 9018 xi2 = 85475 P= 213 P yi = 6271714 yi xi = 732154. This gives x̄ = 81, ȳ = 693.6923, ssxx = 182, ssxy = 1696 and ssyy = 15996.77. Thus the are ssxy = 9.3187 ssxx α b = ȳ − βbx̄ = −61.1209 βb = giving the Oliver Johnson ([email protected]) b = −61.1209 + 9.3187x. regression line y = α b + βx Statistics 1: @BristOliver c TB 2 UoB 2017 242 / 279 Leaning Tower of Pisa Example Example 9.12. From this the fitted values and the residuals in the table can be calculated, using the formulas b i ybi = α b + βx ebi = yi − ybi i = 1, . . . , n. A of the data is shown on the left on the next page, together with the fitted regression line. There seems to be a good fit of the straight line to the data. A plot of the right. against the corresponding year is shown on the As we’d hope, the residuals appear to be fairly random, with no obvious systematic pattern or systematic trend in variability (see Section 9.7). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 243 / 279 Leaning Tower of Pisa Example Example 9.12. Pisa − residuals 4 2 −6 −2 0 residuals(pisa) 720 680 640 pisa.tilt 6 760 Pisa − scatter plot 76 78 80 82 84 86 76 78 80 82 84 86 pisa.year pisa.year Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 244 / 279 Section 9.6: Fitting linear regression models using R Example 9.13. R has a simple command for fitting regression models. This command produces an R object, containing numerical outputs which can be accessed by applying appropriate commands Assume the predictor (x) values are in a data array xdata and the response (y ) values are in a data array ydata. We can perform an initial analysis with the commands: > plot(xdata,ydata) > xyoutput <- lm(ydata ~ xdata) > coef(xyoutput) The first line produces an initial scatter plot, The second line tells R to perform a linear regression with the response values in ydata and the predictor values in xdata and to store the output in the object xyoutput. The third line produces a b vector containing the least squares estimates α b and β. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 245 / 279 Example 9.13. > plot(xdata,ydata, abline(coef(xyoutput))) will produce a scatter plot together with the fitted regression line – i.e. the line whose intercept is the first value and whose slope is the second value in the vector coef(xyoutput) . > fitted(xyoutput) > residuals(xyoutput) will output vectors of fitted values and residual values. Thus, for example, we can plot the residuals against the predictor values with the command: > plot(xdata,residuals(xyoutput)) In Section 10, we will look at other outputs such as summary(xyoutput), which produces (among other things) b estimates of and of Var (b α) and Var (β). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 246 / 279 Leaning Tower of Pisa Example Example 9.14. For the Leaning Tower of Pisa example in Section 9.5, I put the predictor values in the vector pisa.year and the response values in the vector pisa.tilt. I used the commands: > attach(pisa); pisafit <- lm(tilt ~ year) to perform the linear regression analysis and store the output in the object pisafit. I then inspected the scatter plot and the fitted line with the commands: > plot(year,tilt, abline(coef(pisafit))) and inspected the values of the least squares estimates with: > coef(pisafit) which gave output: (Intercept) year -61.120879 9.318681 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 247 / 279 Example 9.14. Finally I inspected the fitted values and the values of the residuals with the commands: > fitted(pisafit) > residuals(pisafit) I plotted the residuals against the predictor (year) values with the command: > plot(year, residuals(pisafit)) For those who are interested, I used the segments command, specifically > segments(year,0,,residuals(pisafit)); abline(h=0) to add the extra lines – see help(segments). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 248 / 279 Section 9.7: Quality of fit of linear regression model One way of assessing the fit of a model is by examining a plot of the eb1 , eb2 , . . . , ebn . These can be plotted against the values x1 , x2 , . . . , xn or the fitted values yb1 , yb2 , . . . , ybn . If the model in Section 9.2 is correct, then e1 , e2 , . . . , en is a random sample from a distribution with expectation 0 and variance σ 2 . We cannot observe or calculate e1 , e2 , . . . , en , but we can look at their estimates eb1 , eb2 , . . . , ebn instead. In linear regression examples, you should always plot the points on a scatter plot, draw in the estimated regression line, and plot the residuals. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 249 / 279 Diagnostics What we should see: I I I I no pattern in the size or sign of the residuals (so the linear model is correct) and, if we assume distributed errors (see next chapter), additionally: a roughly symmetric distribution of the residuals about 0 very few extreme outliers (residuals ≥ 3b σ or ≤ −3b σ , say) If what we see departs from this ideal, we may be able to judge from the pattern we can see how to change the model so that it does fit. This information may not be at all apparent just from the summary data values (see Example 9.15 below) We might allow the error variance σ 2 to depend on x, or we could include a quadratic term in the model, like E(Y |x) = α + βx + γx 2 . This is beyond the scope of this unit. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 250 / 279 Anscombe’s Quartet Example 9.15. The artificial example below, due to Anscombe, brings out this point. It consists of four artificial data sets, each of 11 data pairs, with the same values of the relevant summary statistics. Thus each data set gives rise to exactly the same regression line and exactly the same inferences for α, β and σ 2 . The data are contained in the Statistics 1 data set anscombe. The summary P statistics P for each data set are (approximately): n xi =P99 yi = 82.5 P= 211 P xi = 1001 yi2 = 660 yi xi = 797.5. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 251 / 279 Anscombe’s Quartet 15 10 y2 5 0 0 5 y1 10 15 Example 9.15. 0 5 10 15 20 0 5 10 15 20 15 20 10 y4 5 0 0 5 y3 10 15 x2 15 x1 0 5 10 15 20 0 5 x3 Oliver Johnson ([email protected]) 10 x4 Statistics 1: @BristOliver c TB 2 UoB 2017 252 / 279 Anscombe’s Quartet Example 9.15. From the scatter plots with the fitted regression lines, we see immediately that there is a lack of fit for data sets 2, 3 and 4: I I I in data set 2 the relationship between x and y is quadratic rather than linear so the simple linear model is incorrect; in data set 3 the simple linear regression model is correct, but a very clear regression line is distorted by the effect of a single outlier; in data set 4, the regression line is particularly sensitive to the y value for the single observation taken at x = 19 and it is impossible to tell from this choice of x values whether or not a simple linear regression model is suitable. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 253 / 279 Section 10: Linear Regression: Confidence Intervals & Hypothesis Tests Aims of this section: We continue with the simple linear regression model, under the extra assumption of normality. We use confidence intervals and hypothesis tests to investigate the pattern of variation of the mean value of a response variable Y with the corresponding value of a quantitative predictor variable x. Suggested reading: Rice Oliver Johnson ([email protected]) Sections 14.1–14.2 Statistics 1: @BristOliver c TB 2 UoB 2017 254 / 279 Objectives: by the end of this section you should be able to State the model assumptions for the simple Normal linear regression model. Derive the mean, variance and distribution of the estimators of the slope and intercept of the fitted regression line, and calculate the corresponding standard errors. Construct exact confidence intervals for the values of both the slope and the intercept. Perform standard hypothesis tests on the values of both the slope and the intercept. Use the summary() command in R to calculate confidence intervals and perform hypothesis tests on the values of both the slope and the intercept. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 255 / 279 Section 10.1: Simple normal linear regression We make an extra assumption on the distribution of e. This lets us perform hypothesis tests and find . Definition 10.1. Let x1 , . . . , xn be the n given values for variable X . For i = 1, . . . , n, assume that the value of yi of the response variable is an observed value of the random variable Yi . Further assume that 1 2 3 Yi = α + βxi + ei , where ei are independent identically distributed (IID) N(0, σ 2 ), and α, β and σ 2 are unknown parameters. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 256 / 279 Checking assumption of normality Remark 10.2. Note that for given x1 , . . . , xn , the following are equivalent: (i) ei are IID normal N(0, σ 2 ). (ii) , Var (ei ) = σ 2 , ei independent, ei normal. (iii) E(Yi ) = α + βxi , Var (Yi ) = σ 2 , Yi independent, Yi normal. Note that we cannot check the assumption that Yi ∼ N(α + βxi , σ 2 ) from the data by simply making a histogram, stem-and-leaf plot, or QQ plot of the data y1 , y2 , . . . , yn , since all the observations have normal distributions. But we can carry out a check after the linear regression has been fitted, by looking at the residuals. Continuing the example in section 9.5, typing > qqnorm(residuals(xyoutput)) shows a Normal Q-Q plot of the residuals and helps check for non-Normality. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 257 / 279 Section 10.2: Properties of α b, βb and σb2 If we took repeated independent samples of the Yi , keeping the predictor values fixed at x1 , . . . , xn , the values of α b and βb would vary from sample to sample as the values for y1 , . . . , yn vary. Theorem 10.3. If {ei } are normally distributed (as in Definition 10.1) then: (i) βb ∼ P (ii) α b ∼ N(α, σ 2 (1/n + x 2 /ssxx )) = N(α, σ 2 xi2 /(nssxx )). Remark 10.4. 1 2 In fact, these values of mean and variance hold without assuming the ei are normal. b Under the assumption of normality, the least squares estimates (b α, β) are also the maximum likelihood estimates (so we have two good reasons to think they will be reasonable estimates). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 258 / 279 Proof of Theorem 10.3(i) Proof. P Note that P Pnas i (xi − x) = 0, the ssxy = i=1 (xi − x)(yi − y ) = ni=1 (xi − x)yi . Thus, considered as a random variable, n = X (xi − x) ssxy = Yi = ssxx ssxx 1 Here, for given fixed values of x1 , . . . , xn , the bi = (xi − x)/ssxx , i = 1, . . . , n are fixed constants and the Yi are independent N(α + βxi , σ 2 ). From Lemma 5.8 we can immediately deduce that βb has a normal distribution, since it is a linear combination of independent normals. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 259 / 279 Proof. n X bi = bi xi = 1 n X 1 n X bi2 = 1 n 1 X 0 (xi − x) = , ssxx ssxx 1 Pn Pn 2 Pn ssxx 1 xi − x 1 xi 1 (xi − x)xi = = = 1, ssxx ssxx ssxx n 1 X ssxx 1 (xi − x)2 = = . 2 2 (ssxx ) (ssxx ) ssxx 1 b To calculate the mean ! and nvariance for β: n n n X X X X =E bi Yi = bi (α + βxi ) = α bi + β bi x i = β 1 Since Yi are b = Var Var (β) Oliver Johnson ([email protected]) 1 n X 1 bi Yi ! 1 = n X 1 bi2 Var (Yi ) = σ 2 1 Statistics 1: @BristOliver n X bi2 = 1 c TB 2 UoB 2017 σ2 . ssxx 260 / 279 Proof of (ii) Proof. b Derivation of the distribution for α b is very similar to that for β. We start by noting that b = = Y − βx n X 1 Yi /n − x n X bi Yi = 1 n X 1 Yi (1/n − bi x) = n X ai Yi i=1 where ai = (1/n − bi x), i = 1, . . . , n. This in turn bP has a Normal distribution P and gives Pnmeans α n n 2 2 E(b α) = α 1 ai + β 1 ai xi and Var (b α) = σ 1 ai . Pn Finally, andP Pn Pn 2 using the facts that 1 bi = 0, b = 1/ss , one can easily deduce that a = 1, n1 ai xi = 0 xx 1 i P1n 2i P 2 xi2 /n ssxx 1 ai = (1/n + x /ssxx ) = This means that E(b α) = α and Var (b α) = σ 2 (1/n + x 2 /ssxx ). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 261 / 279 t-distributions for α b and βb Although σ 2 is unknown, we can the t-distribution. it in the usual way, using We combine Lemma 10.5 below and Theorem 10.3. Lemma 10.5. c2 /σ 2 ∼ χ2 (n − 2)σ n−2 b of α b and β. Key fact: This holds It means that (since a χ2r random variable has mean r and variance 2r – see Remark 5.12): Proof. c2 ) = σ 2 E(σ and c2 ) = 2σ 4 /(n − 2). Var (σ Not proved here. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 262 / 279 Form of t-distributions Theorem 10.6. q c2 1/n + x 2 /ssxx for the Write sαb = σ sαb is an estimate of the standard deviation of α b. q c2 /ssxx for the standard error for β. b Write sβb = σ for α b. If the assumptions of Definition 10.1 hold, then α b−α sαb b β−β sβb Oliver Johnson ([email protected]) ∼ tn−2 ∼ tn−2 . Statistics 1: @BristOliver c TB 2 UoB 2017 263 / 279 Distribution of α b Proof. Theorem 10.3 gives that σ p (b α − α) 1/n + x 2 /ssxx ∼ N(0, 1). Lemma 10.5 gives that (independently) c2 /σ 2 ∼ χ2 /(n − 2). σ n−2 Hence, using Definition 5.16, we know that α b−α (b α − α) 1 1 q = p ∼ N(0, 1) q , sαb σ 1/n + x 2 /ssxx σ c2 /σ 2 χ2n−2 /(n − 2) as required. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 264 / 279 Distribution of βb Proof. Theorem 10.3 gives that (βb − β) p ∼ N(0, 1). σ 1/ssxx Lemma 10.5 gives that (independently) c2 /σ 2 ∼ χ2 /(n − 2). σ n−2 Hence, using Definition 5.16, we know that βb − β (βb − β) 1 1 q = p ∼ N(0, 1) q , sβb 2 σ 1/ssxx σ c2 /σ 2 χn−2 /(n − 2) as required. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 265 / 279 Section 10.4: Confidence intervals for α and β Example 10.7. Theorem 10.6 shows that if the assumptions of then (b α − α)/sαb ∼ tn−2 . hold Hence we can obtain a 100(1 − γ)% for α. P(−tn−2;γ/2 ≤ (b α − α)/sαb ≤ tn−2;γ/2 ) = 1 − γ. We can make α the subject of this in the usual way, to obtain P(b α − tn−2;γ/2 sαb ≤ α ≤ α b + tn−2;γ/2 sαb ) = 1 − γ. Hence, under the assumptions of Definition 10.1, taking (cL (α), cU (α)) = (b α − tn−2;γ/2 sαb , α b + tn−2;γ/2 sαb ) (cL (β), cU (β)) = (βb − tn−2;γ/2 s b, βb + tn−2;γ/2 s b) β gives a 100(1 − γ)% confidence interval for Oliver Johnson ([email protected]) Statistics 1: @BristOliver β and β. c TB 2 UoB 2017 266 / 279 Example – the Leaning Tower of Pisa Example 10.8. We previously met this example in Example 9.12. Some of P the basic arithmetic: P 2 n = 13, x = 1053, , i i i xi = 85475, P 2 P y = 6271714, x y = 732154. i i i i i So x = 81, y = 693.6923, ssxx = 182, ssyy = 15996.77, ssxy = 1696 Then βb = 9.319, α b = −61.121, σ b2 = 17.481, s 2 = 631.51, α b s 2b = 0.096047. β Finally, , sαb = 25.130, sβb = 0.3099. We make the standard assumptions of Definition 10.1, so Theorem 10.6 implies that (b α − α)/sαb ∼ tn−2 = t11 . Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 267 / 279 Example – the Leaning Tower of Pisa Example 10.9. Example 10.7 shows that a 100(1 − γ)% confidence interval for (cL , cU ) = (b α − tn−2;γ/2 sαb , α b + tn−2;γ/2 sαb ) For example, if we want a 90% confidence interval, is , so (cL , cU ) = (−61.121 − 1.796 × 25.130, −61.121 + 1.796 × 25.130) = (−106.25, −15.99). Similarly t11;0.025 = 2.201 so a 95% confidence interval for β is (cL , cU ) = (βb − tn−2;γ/2 sβb, βb + tn−2;γ/2 sβb) = (9.319 − 2.201 × 0.3099, 9.319 + 2.201 × 0.3099) = Oliver Johnson ([email protected]) . Statistics 1: @BristOliver c TB 2 UoB 2017 268 / 279 Section 10.5: Hypothesis tests for β As in Definition 10.1, assume that ei are . The model Yi = α + βxi + ei would be simpler if β = 0. In particular, we might want to know if the expected value of the response varies systematically with the predictors. If not, then we can simplify the model by removing β. We can place all of this in our standard hypothesis testing framework. Remark 10.10. We can use a similar argument, based on the fact that (b α − α)/sαb ∼ tn−2 to test whether α = 0. However, this corresponds to the regression line passing through the origin; often not an interesting hypothesis. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 269 / 279 Hypothesis tests for β Example 10.11. Model: We make the model assumptions of Definition 10.1. Hypotheses: we test H0 : β = 0 vs . Test statistic: Theorem 10.6 shows that (βb − β)/sβb ∼ tn−2 . b b. This suggests the use of T = β/s β When H0 is true (i.e. β = 0) then T ∼ tn−2 . b b: p-value Since we consider a two-sided alternative, with tobs = β/s β p-value = P ( |T | ≥ |tobs | | H0 true) = P(|tn−2 | ≥ |tobs |) = 2(1 − P(tn−2 ≤ |tobs |)). Critical value Critical region for a γ-level test is C = {|T | ≥ c ∗ }. c ∗ is defined by c ∗ = tn−2;γ/2 , since γ = P( reject H0 | H0 true) = P(|T | ≥ c ∗ | H0 true) = P(|tn−2 | ≥ c ∗ ) = 2P(tn−2 ≥ c ∗ ), Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 270 / 279 Example – the Leaning Tower of Pisa Example 10.12. We return to the setting of Examples 9.12 and 10.8. Wish to test H0 : β = 0 vs H1 : β 6= 0 (does mean tilt vary with year?) b b ∼ tn−2 . As in Example 10.11, we use T = β/s β In this case, tobs = 9.319/0.3099 = 30.071, to compare with a . p-value: given by 2(1 − P(t11 ≤ 30.071)). R gives pt(30.071,11) = 1, so p-value is zero. In fact, using lm() command gives p-value of 6 × 10−12 . Critical region: similarly, for a test with significance level γ = 0.05, we have tn−2;γ/2 = t11;0.025 = 2.201. Hence C = {|T | ≥ 2.201}. p-value is small, and tobs lies well within critical region. This is very strong evidence that H0 is not true – that the mean tilt does vary with year. Consistent with the fact that 0 ∈ / (cL , cU ) in Example 10.8. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 271 / 279 Section 10.6: Confidence Intervals and Hypothesis Tests using the summary command in R Consider the simple Normal linear regression model Yi = α + βxi + ei , where the ei are i.i.d. Assume the values x1 , . . . , xn are contained in an R data vector called xdata and the values y1 , . . . , yn are contained in an R data vector called ydata. We have already seen how to produce the output using the R command xyoutput <- lm(ydata ~ xdata) We can perform exploratory data analysis, estimation and assessment of fit using the follow-up commands plot, coef, fitted, and residuals. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 272 / 279 Summary command in R For confidence intervals and hypothesis tests, most of the necessary information can be obtained with the summary command. For example summary(xyoutput) produces the following output, where the formulas shown in each box are replaced in the actual output by its numerical value. Then the output: (i) reproduces the formula used to produce the output, to show exactly which model is being analysed; (ii) produces a summary of the residual values (or lists in full the numerical values of the residuals if there are only a few of them); (iii) lists the relevant values for confidence intervals and hypothesis tests – first for α (the intercept in the model) and then for β (the coefficient of in the model); (iv) lists numerical values relevant to estimating σ 2 (or, more precisely, σ); (v) gives information on the and F -statistic values (not covered in this unit). Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 273 / 279 Call: lm(formula = ydata ~ xdata) Residuals: Min 1Q Coefficients: Median 3Q Estimate Max Std. Error s 1 x2 + n ssxx (Intercept) b α b = y − βx sα b b =σ xdata βb = ssxy /ssxx √ sβb = σ b/ ssxx t value Pr(>|t|) α b/sα b 2(1 − Ftn−2 (|b α/sα b |)) b b β/s β b b|)) 2(1 − Ftn−2 (|β/s β --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 s 2 /ss ssyy − ssxy xx Residual standard error: σ b= on n − 2 degrees of freedom n−2 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 274 / 279 In particular, the values on the line beginning (Intercept) are: (i) (the estimate of α), (ii) sαb (the standard error, which estimates the standard deviation α b), (iii) tobs = α b/sαb (the observed test statistic for testing H0 : α = 0 vs. H1 : α 6= 0), (iv) P(|W | > |tobs |), where W ∼ tn−2 (the p-value of the data for the test). The result of a hypothesis test of vs. H1 :α 6= 0 can then be deduced immediately from the corresponding p-value. Moreover, the endpoints for a 100(1 − γ)% confidence interval for α can be calculated using the values of α b, sαb and the appropriate t-distribution percentage point tn−2;γ/2 . The values on the line beginning xdata are the corresponding quantities for estimating, constructing confidence intervals or performing hypothesis tests on β: b (ii) s b, (iii) tobs = β/s b b, and (iv) P(|W | > |tobs |), i.e. (i) β, β β where W ∼ tn−2 . A 100(1 − γ)% confidence interval for β can be obtained in a similar manner to that for α. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 275 / 279 Section 10.7: The Leaning Tower of Pisa example in R Example 10.13. In our previous analysis (Example ) we had typed pisa<-lm(pisa.tilt~pisa.year) to carry out the linear regression. Applying the summary(pisa) command using this previous result produces the output below. You can (and should) check that the values shown correspond to the appropriate values calculated in your notes when we constructed confidence intervals and performed hypothesis tests on α and β. From the output we can, for example, immediately read off the least squares estimate and its standard error We can also see that the p-value for testing H0 :β = 0 versus H1 :β 6= 0 is extremely small (6.5 × 10−12 ) and so there is very strong evidence that β is not zero and the mean tilt does vary significantly with the year. Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 276 / 279 Example 10.13. Call: lm(formula = pisa.tilt ~ pisa.year) Residuals: Min 1Q -5.9670 -3.0989 Median 0.6703 3Q 2.3077 Max 7.3956 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -61.1209 25.1298 -2.432 0.0333 * pisa.year 9.3187 0.3099 30.069 6.5e-12 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.181 on 11 degrees of freedom Multiple R-Squared: 0.988, Adjusted R-squared: 0.9869 F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-12 Oliver Johnson ([email protected]) Statistics 1: @BristOliver c TB 2 UoB 2017 277 / 279 Parametric Families Summary Sheet Family Parameter values pmf or pdf X values Mean Variance Bernoulli(θ) 0<θ<1 pX (x; θ) = θx (1 − θ)1−x K θx (1 − θ)K−x pX (x; θ) = x x = 0, 1 E(X; θ) = θ Var(X; θ) = θ(1 − θ) x = 0, 1, . . . , K E(X; θ) = Kθ Var(X; θ) = Kθ(1 − θ) x = 1, 2, . . . E(X; θ) = x = 0, 1, 2, . . . E(X; θ) = θ Binomial(K, θ) 0 < θ < 1 (K known) 1 θ 1−θ θ2 Geometric(θ) 0<θ<1 pX (x; θ) = θ(1 − θ)x−1 Poisson(θ) 0<θ<∞ pX (x; θ) = e−θ Uniform(0, θ) θ>0 fX (x; θ) = 1/θ 0<x<θ E(X; θ) = θ 2 Var(X; θ) = θ2 12 Exponential(θ) θ>0 fX (x; θ) = θe−θx x>0 E(X; θ) = 1 θ Var(X; θ) = 1 θ2 Gamma(α, λ) α > 0, λ > 0 fX (x; α, λ) = x>0 E(X; α, λ) = Normal(μ, σ 2) −∞ < μ < ∞, σ 2 > 0 fX (x; μ, σ 2 ) = √ −∞ < x < ∞ E(X; μ, σ 2) = μ Oliver Johnson ([email protected]) θx x! λα xα−1 e−λx Γ(α) 1 2πσ 2 2 /2σ 2 e−(x−μ) Statistics 1: @BristOliver Var(X; θ) = Var(X; θ) = θ α λ Var(X; α, λ) = α λ2 Var(X; μ, σ 2 ) = σ 2 c TB 2 UoB 2017 278 / 279 Examples of pdfs in the Exp() family 1.5 f(x;) 0.5 f(x;) 1 0.5 2 1 0.5 0.0 0.0 5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x Examples of pdfs in the Gamma(,) family Examples of pdfs in the Normal(,2) family 0.4 x 0, 1 2, 1 f(x;) 0.3 0.5, 1 2, 1 0, 1.5 0.0 0.1 7, 2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 f(x;) 2 1.0 1.0 2.0 1.5 Examples of pdfs in the Unif(0,) family 0 1 2 3 4 5 6 −4 x Oliver Johnson ([email protected]) −2 0 2 4 x Statistics 1: @BristOliver c TB 2 UoB 2017 279 / 279