Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 1: Basics of data analysis Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. Ensure that you can run R – you could download R (for free) onto your own computer, log in remotely to the Student Desktop, or run R directly in the undergraduate Computer Lab. Running R is a central part of this course, and you do need to be able to do it. *2. In an experiment to investigate the heat of sublimation of iridium, the following 27 measurements were made, listed across the rows in the order they were taken. The data is contained in the Statistics 1 data set iridium. 136.6 160.4 160.0 145.2 161.1 160.2 151.5 160.6 160.1 162.7 160.2 160.0 159.1 159.5 159.7 159.8 160.3 159.5 160.8 159.2 159.5 173.9 159.3 159.6 160.1 159.6 159.5 (a) Use the R commands stem, hist, boxplot, and plot to make a stem-and-leaf plot, a histogram, a boxplot and a plot of the observations in the order they were taken. Print your plots and comment on the overall pattern and any striking features. (b) Use the R commands median and mean to find the median and the mean. Use ?mean in R to see how to compute a trimmed mean in R . Compute the 10% and 20% trimmed means for the iridium data set. Compare how well the the mean and median and trimmed means represent the centre of this data set. (c) Use the R commands var and sd to find the sample variance and standard deviation, use the R commands fivenum and summary to find the hinges and the sample quartiles, and use the R command IQR to find the interquartile range (but see comments on ‘Hinges and Quartiles’ overleaf). Again, compare how these values represent the spread of the data. (d) What conclusions do you draw from your plots and numerical summaries? What effect do the outliers have on the numerical and graphical summaries? What would the corresponding results look like if the outliers were removed? 3. Make a new version of the iridium data set, excluding the apparent outliers, by typing ir2<-iridium[-c(1,2,3,4,8)]. Create a histogram and stem and leaf plot of this new data set. Now make similar plots for an artifical sample made by generating the same number (22) of observations from a normal distribution (e.g. data<-rnorm(22)). Visually compare the plots for the real data and the artificial data. 4. Construct an R function for calculating an empirical distribution function by typing in the following instructions (note that the R prompt will switch from > to + while it is waiting for the command to be completed): plot.edf <- function(x){ n <- length(x) plot(sort(x),(1:n)/(n+1),type=’s’,xlab=’data’,ylab=’empirical cdf’)} Having loaded the Statistics 1 data sets, produce an empirical distribution function (edf) plot of the iridium data by typing the command plot.edf(iridium) and comment on how the shape of the edf relates to the data. *5. Having loaded the Statistics 1 data set into R , use stem(us.temp,scale=4) to produce a stem-and-leaf plot of the dataset us.temp, which gives the mean January temperatures for 60 U.S. metropolitan areas. Use R to give a five number summary of the data. By examining the data set directly, comment on any unusual pattern in the data and try to find a plausible explanation. 6. (a) (From a recent Guardian puzzle) A lazy flea is wandering along a ruler. He knows that at a certain time, he will receive an instruction to move to the 1 inch mark on the ruler, the 2 inch mark or the 11 inch mark. Which of these it will be is uncertain, and he can assume there is 1/3 probability of each of these possibilities. Where should he position himself to minimise the distance he has to move when instructed? (b) How does your answer change if instead he want to minimise the mean squared distance he has to move? (c) What if he wants to minimise the maximum distance he might have to move? (d) What does this question have to do with the issue of giving a numerical summary of the centre of a sample of data x1 , x2 , . . . , xn ? *7. Let {x1 , . . . , xn } be a data set of real numbers and let yi = axi + b, for i = 1, . . . , n, for some a 6= 0. (a) Let x = 1 n Pn i=1 xi and s2x = 1 n−1 Pn 2 i=1 (xi −x) . Show that y = ax+b and s2y = a2 s2x . (b) Find expressions for the median, interquartile range and trimmed mean of {yi } in terms of those of {xi }. (Note: why do you need to consider the cases a > 0 and a < 0 separately?) (c) Let x denote temperature in degrees centigrade and let y denote temperature in degrees Fahrenheit, so y = 1.8x + 32. Assume the {xi } data set has mean 68.1, median 68.9, variance 3.2 and IQR 7.7. Calculate the corresponding quantities for the {yi } data. 8. Boxplots are most useful for comparing more than one sample. The built-in data set InsectSprays in R gives the number of insects found on plants subjected to 6 different treatments labelled A-F. Type the following in R : data(InsectSprays) help(InsectSprays) InsectSprays boxplot(count ˜ spray, data = InsectSprays) The help command gives some background information about the data, and the command InsectSprays on its own prints out the data. For this data set, the boxplot command produces a separate boxplot (on common axes) for each of the treatments. Use this plot to compare the different treatments. Calculate the mean and variance for each of the treatment types and see if you come to the same conclusions. (It is good practice working out how to do this in R ). Hinges and Quartiles The lower hinge is ‘the median of the set of values ≤ the sample median’ and the upper hinge is ‘the median of the set of values ≥ the sample median’. Hinges were introduced by Tukey as a simple alternative to quartiles, since sources disagreed on how quartiles should be defined. Loosely speaking, ‘quartiles’ are values that divide a dataset into four equal parts – a quarter of the data values are greater than the upper quartile, a quarter are between the upper quartile and the median, a quarter are between the median and the lower quartile, and a quarter are less than the lower quartile. Given a dataset with n data values x1 , x2 , . . . , xn−1 , xn , denote the ordered values (the order statistics) by x(1) ≤ x(2) ≤ · · · ≤ x(n−1) ≤ x(n) . This suggests Q1 should be roughly the n/4th observation (ordered in increasing size). If n/4 is not an integer, then Q1 should lie between x(i) and x(i+1) where i = [n/4] (the integer part of n/4). The methods actually used to compute, say, Q1 are more complicated than this, but most have the following common basis: Set r = (n + 1 − 2a)/4 + a, set i = [r] (the integer part of r) and set γ = r − i. Then the required value is Q1 = (1 − γ)x(i) + γx(i+1) , i.e. the value that lies γ of the way between x(i) and x(i+1) . Similarly for Q3 , set s = 3(n + 1 − 2a)/4 + a, set j = [s] and set γ = s − j. Then the required value is Q3 = (1 − γ)x(j) + γx(j+1) . Where methods differ is in the value of a – some have used a = 0 (Minitab), some a = 1 (Excel), some a = 1/2 (S-plus). Rice suggests using a = 0 or a = 1/2. R uses a = 1/2 in some places and a = 3/8 in other places. The differences between using different values of a, or indeed the differences between the hinges and the quartiles, have no real practical importance in terms of interpretation, and they are negligible numerically in larger data sets. MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 2: Parametric families and method of moments Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. Let X1 , . . . , Xn be a random sample from a Geom (θ) distribution, where θ is a single unknown parameter. State the value of E(X; θ) and hence find the method of moments estimator of the unknown parameter θ. Hence given the sample {3, 2, 1, 6, 2, 4, 1}, assumed to come from this distribution, give the estimate θbmom . 2. Let X1 , . . . , Xn be a random sample from a N (0, θ2 ) distribution, where θ > 0 is a single unknown parameter. Find the method of moments estimator of the unknown parameter θ. 3. The data {2.08, 2.81, 0.04, 1.54, 1.27, 0.74} are thought to come from a Uniform(0,3) distribution. Calculate the corresponding expected quantiles of the Uniform(0,3) distribution. Use R to plot the sample quantiles against these expected quantiles, and comment on the fit of the distribution to the data. *4. The data in the disasters data set relates to all British coal mining disasters between March 1851 and March 1962 in which 10 or more men were killed. We will study the 120 gaps in days between successive disasters from the start of the series up to January 1889, which we can extract as follows source("https://people.maths.bris.ac.uk/∼maotj/teach/disasters.R") (beware: the ∼ symbol may not cut and paste from pdf into R ) gaps<-disasters$gap[2:121] (if you are not going to be online when using R , you have first to copy the file from the link on the website and save it on your computer, then use the Source R code item on the File menu in R to navigate to the saved file and load it in). (a) Use R to plot a histogram of the gaps data. Does the histogram indicate that an exponential distribution would not be an appropriate model for this set of data? Note that if the occurrence of disasters was completely at random, the gaps would have the ‘lack of memory’ property, with an exponential distribution. (b) Assuming an Exponential distribution with parameter θ is an appropriate model, show that method of moments estimate of θ for this data set is θb = 0.008681. (c) You are given that the distribution function FX (x; θ) has inverse FX−1 (y; θ) = − log(1 − y)/θ Use R to produce a quantile plot of the sample quantiles (the ordered observations) against the corresponding approximate expected quantiles for the fitted Exponential distribution and comment on how well the estimated Exponential distribution fits the data. *5. In an experiment to investigate the effect of seeding clouds, the rainfall measurements below were recorded for 25 seeded clouds. The data are contained in the Statistics 1 data set seeded.rain. 4.1 119.0 430.0 7.7 129.6 489.1 17.5 198.6 703.4 31.4 32.7 40.6 92.4 200.7 242.5 255.0 274.7 978.0 1656.0 1697.8 2745.6 115.3 302.8 118.3 334.1 It is thought that a distribution in the family Gamma(α, λ) may provide a good model for the data. Write down the two equations which determine the method of moments estimates of the two unknown parameters α and λ. Solve these equations to find explicit expressions b each in terms of the sample moments m1 and m2 . for α b and λ, 6. The following data come from a series of experiments by Henry Cavendish in 1798, designed to measure the density of the earth, as a multiple of the density of water. The data are contained in the Statistics 1 data set cavendish. 5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.65 5.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.39 5.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85 (a) Use exploratory methods (histogram, boxplot etc.) to see if there is any immediate reason to believe a N (µ, σ 2 ) distribution would not provide an appropriate model for these data. (b) Calculate the method of moments estimates of µ and σ 2 . Use the estimates to produce a plot in R of the sample quantiles against the corresponding approximate expected quantiles for the fitted Normal distribution. Comment on how well the estimated Normal distribution fits the data. (c) You can produce an automatic plot of the quantiles of a data set against the corresponding quantiles of a standard N (0, 1) distribution, and, if desired, a fitted line through the first and third quartiles, with the commands > qqnorm(dataset) > qqline(dataset) where dataset is the name of the data file. Comment on the similarities and differences between this plot and the plot in part (b) above. MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 3: Likelihood and Maximum Likelihood Estimation Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) 1. Find an expression for the maximum likelihood estimate of θ in terms of the observed values x1 , . . . , xn of a random sample of size n from a Poisson(θ) distribution where θ > 0 is an unknown parameter. *2. Find an expression for the maximum likelihood estimate of θ in terms of the observed values x1 , . . . , xn of a random sample of size n from a continuous distribution with probability density function θ−1 θx 0≤x≤1 fX (x; θ) = 0 otherwise where θ > 0 is an unknown parameter. In a random sample of size n = 5 from this distribution, the observed values of x1 , x2 , . . . , x5 were 0.07, 0.29, 0.78, 0.61 and 0.92 respectively. Compute the value of the maximum likelihood estimate of θ for this set of observations. 3. (a) Find an expression for the maximum likelihood estimate of θ in terms of the observed values x1 , . . . , xn of a random sample of size n from a Binomial(K, θ) distribution where K is known and where θ is an unknown parameter such that 0 < θ < 1. (b) Passing a course is based on an exam with 10 multiple choice questions, and a pass mark of 9. In seven mock exams a student has obtained scores of 9, 8, 10, 5, 8, 10 and 7. Assuming that she has independent probability θ of correctly answering each question in the exam, and assuming these seven scores are a random sample of size seven from a Binomial(10, θ) distribution with the same value of θ, find the maximum likelihood estimate of the probability that the student will pass the examination. *4. Consider the following data, recording the failure time, in hours, for a batch of 25 lamps. The data are contained in the Statistics 1 data set lamp. 5.5 0.7 3.8 4.0 8.0 1.6 7.8 2.6 9.3 0.7 4.7 0.2 4.0 3.1 0.3 1.0 4.6 3.4 0.6 7.9 3.7 10.8 1.8 1.2 4.0 Assuming an Exponential distribution with parameter θ is an appropriate model, find θbmle based on the above data. Hence find the maximum likelihood estimates of: (a) the median of the distribution of the lifetimes of lamps in the population; (b) the probability of a randomly chosen lamp surviving beyond 10 hours. Compare these to appropriate simple estimates calculated directly from the data, without assuming an Exponential distribution. *5. Let X1 , . . . , Xn be a random sample of size n from a N (µX , σ 2 ) distribution and let Y1 , . . . , Ym be a random sample of size m from a N (µY , σ 2 ) distribution, and assume the samples are independent of each other. Note that the means of the two distributions are assumed to be possibly different, but the variances are assumed to be the same. (a) Since all n + m random variables are independent, show that the loglikelihood func2 tion all n + m observations is given by Pn for µX , µY and σ 2based Pon m 2 log f (x ; µ , σ ) + X i X i=1 j=1 log fY (yj ; µY , σ ). (b) Hence, explain why the likehihood equations for this problem are Pn ∂ 2 i=1 xi − nµX l(µX , µY , σ = =0 ∂µX σ2 Pm ∂ j=1 yj − mµY l(µX , µY , σ 2 ) = =0 ∂µY σ2 Pm Pn 2 2 n m ∂ j=1 (yj − µY ) 2 i=1 (xi − µX ) l(µX , µY , σ ) = − + − + =0 ∂σ σ σ3 σ σ3 (c) Hence find expressions in terms of x1 , . . . , xn and y1 , . . . , ym for the joint maximum likelihood estimates of µX , µY and σ 2 from the combined sample. 6. Assume n subjects are each given an envelope. Half the envelopes contains the instructions “Tick box 1 if you have ever cheated on your tax return and tick box 0 otherwise”; the others contain the instructions “Toss a coin and tick box 1 if it falls heads and box 0 if it falls tails”. The two sets of instructions are allocated to the envelopes at random and only the subject knows which set of instructions applied to him (or her). Assume that all subjects follow the instructions in their envelope honestly and correctly. (a) Assume the probability that any given subject cheated on their tax return is τ and let θ denote the probability that a randomly chosen subject, following the procedure above, will tick box 1. Express τ in terms of θ. What are the possible values for θ? (b) Assume 8 out of a sample of 20 subjects ticked box 1; find the maximum likelihood estimate of θ based on these data. Hence estimate τ . MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 4: Assessing the performance of estimators Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. Let X1 , . . . , Xn be a random sample from the Uniform(0, θ) distribution, for which the population median is τ = θ/2. (a) The method of moments estimate of τbmom = X. Find E(X) and Var (X), and hence show that τbmom is unbiased and has mean square error θ2 /12n. (b) The maximum likelihood estimate of θ is Y = max{X1 , . . . , Xn }. You are given that Y has probability density function fY (y; θ) = ny n−1 /θn for 0 < y < θ (and fY (y; θ) = 0 otherwise). By calculating the relevant integral, show that E(Y ) = nθ/(n + 1). The maximum likelihood estimator of the population median τ = θ/2 is τbmle = max{X1 , . . . , Xn }/2. Use the results for Y to show that τbmle has bias −θ/(2(n + 1)). *2. The methods and R commands required for this question are similar to those in the notes for the simulation from the Uniform(0,1) distribution, but with the obvious adjustments to the names and to the formulae for the various estimators. (a) Use the R commands below to construct a matrix xsamples with 1000 rows and 10 columns. The 10 data values in each row can be thought of as a random sample of size 10 from an Exp(θ) distribution with rate θ = 1 and the 1000 rows can be thought of as B = 1000 independent repeated samples. > xvalues <- rexp(10000,rate=1) > xsamples <- matrix(xvalues,nrow=1000) (b) Calculate the maximum likelihood estimate θb = 1/x for each sample, and plot a histogram of the relative frequencies of the 1000 estimates of θ obtained. (c) You may assume that the Exp(θ) distribution has median log(2)/θ. For each of the B = 1000 samples in part (a) above, calculate both the sample median and the maximum likelihood estimate of the population median τ (θ) = log(2)/θ. Produce a single plot containing a boxplot of the 1000 values of the sample median and a boxplot of the 1000 values of the mle of the population median. (d) Since the observations above were drawn from a population distribution with θ = 1, add a horizontal line at height log(2) to your plot showing the true value of the median for this population and use it to compare the sample median and the mle as estimators of the population median. (e) Calculate the mean and the variance of the 1000 values of the sample median and the 1000 values of the maximum likelihood estimate of the population median and use these numerical values to compare the bias, variance and mean square error of the two estimators. 3. Let T be the total number of heads obtained when a fair coin is tossed 10 times. Let A = {T ≤ 1} be the event that at most one head is obtained and let B = {T ≥ 6} be the event that at least 6 heads are obtained. (a) Calculate the exact values of P(A) and P(B). (b) Calculate the approximate values of P(A) and P(B) given by applying the central limit theorem without and with a continuity correction. *4. An architect is designing the car park for a new apartment block, which has 250 apartments. She believes that the residents in 10% of the apartments will require 2 parking places, that 60% will require 1 place, and that the remaining 30% will not have a car. (a) Let X be the number of parking places required by the residents of a randomly chosen apartment. Find the mean and variance of X. (b) If the architect provides 210 car parking places for residents, what is the probability that this will not be enough? How many places would she need to provide for there to be a 99% chance that there will be enough places to satisfy all the residents’ demands? 5. Opinion polls indicate that support for the government has been about 37%, but it is thought that this may have changed in the light of recent events. Assume a random sample of n electors is interviewed. Let Xi = 1 if the ith elector sampled supports the government and let Xi = 0 otherwise, so that Tn = X1 + · · · + Xn is the total number in the sample that say they support the government. Assume throughout that X1 , X2 , X3 , . . . are independent random variables and that P(Xi = 1) = 0.37 = 1 − P(Xi = 0). Assume the pollsters take a sample of size n = 1500. Use the central limit theorem to find the probability that the proportion in the sample supporting the government will differ from 0.37 by no more than 0.02, i.e. find P(|Tn /n−0.37| ≤ 0.02). Perform this calculation with and without a continuity correction, to see if it makes a significant difference in this case. MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 5: Sampling distributions related to the Normal distribution Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. Let X1 , . . . , Xn be a random sample of size n from a general distribution with population mean denoted by µ = E(X) and population variance denoted by σ 2 = Var (X). (Note: we are not assuming here that the population has a Normal distribution). (a) Show that the sample mean X = (X1 + · · · + Xn )/n has expected value µ (and so X is always an unbiased estimator of the population mean). (b) Show also that X has variance σ 2 /n, (and so X always has mean square error equal to σ 2 /n as an estimator of µ, where µ denotes the population mean and σ 2 denotes the population variance). P P 2 (c) Assume the fact (see the notes) that nj=1 Xj2 = nj=1 (Xj − X)2 + nX . Show that, P whatever the distribution of X, the sample variance S 2 = nj=1 (Xj − X)2 /(n − 1) has expected value σ 2 (and so S 2 is always an unbiased estimator of the population variance σ 2 ). 2. Let Y have a Gamma(α, λ) distribution. Show that E(Y ) = α/λ, and show for α > 1, R ∞that, a−1 −bx E(1/Y ) = λ/(α − 1). [Hint: Recall from your Probability 1 notes that 0 x e dx = Γ(a)/ba for all a > 0 and b > 0.] *3. Let X1 , . . . , Xn be a random sample of size n from the Exponential(θ)P distribution. We found earlier that the maximum likelihood estimator of θ was θbmle = n/ ni=1 Xi . P P (a) Find the moment generating function of ni=1 Xi . Hence show ni=1 Xi has the Gamma(n, θ) distribution and state its mean. (b) The population mean of the Exponential(θ) distribution is τ (θ) = 1/θ and the maximum likelihood estimator of this population mean is τbmle = 1/θbmle . Show that the maximum likelihood estimator τbmle has expected value 1/θ (and so it is unbiased as as estimator of the population mean). (c) Using the fact (see previous question) that for Y with a Gamma(α, λ) distribution, E(1/Y ) = λ/(α − 1), show that E(θbmle ) = θn/(n − 1). Hence find the average error (i.e. the bias) of θbmle as an estimator of θ and show it is not an unbiased estimator of θ. 4. Use the R commands help(TDist) and help(Chisquare) to find out how to compute the probability density function, distribution function and the inverse distribution function for the t and the χ2 families of distributions. (a) Plot the probability density function for the tr distribution over the interval (−4, 4) for r = 1, 5, 10 and 15 degrees of freedom, and compare it with the probability density function for the N (0, 1) distribution. (b) Plot the probability density function for the χ2r distribution over the interval (0, 20) for r = 5, 10 and 15 degrees of freedom. *5. Let X1 , . . . , Xn be a random sample of size n from the N (µ, σ 2 ) distribution. Denote the P sample variance by S 2 = ni=1 (Xi − X)2 /(n − 1) and denote the maximum likelihood P estimator of σ 2 by σb2 mle = ni=1 (Xi − X)2 /n. P (a) State the distribution of ni=1 (Xi − X)2 /σ 2 and its mean and variance, and thus find the mean and variance of both S 2 and σb2 mle . (b) Let σ b2 denote an estimator for σ 2 . Remember that the bias of σ b2 is defined as E(b σ2 − σ 2 ), while the mean square error is defined as E[(b σ 2 − σ 2 )2 ]. Note also that it can be easier to calcuate the mean square error from its representation as E[(b σ 2 − σ 2 )2 ] = Var (b σ 2 ) + [bias(b σ 2 )]2 . Use the results of part (a) to compare the performance of S 2 and σb2 mle as estimators of σ 2 in terms of their bias and mean square error. 6. Suppose that X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are all i.i.d. Normal(0, θ2 ) where θ is unknown. (a) Write down the distribution of the random variable Ti = Xi2 + Yi2 for each i. Hence find the maximum likelihood estimate of θ based on observations t1 , t2 , . . . , tn of T1 , T2 , . . . , Tn . (b) Find also the maximum likelihood estimate of θ based on observations x1 , x2 , . . . , xn and y1 , y2 , . . . , yn of X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn , respectively. MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 6: Confidence Intervals Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) To complete this sheet, you will need values of distributions such as t and χ2 . This requires access to R (or the use of statistical tables). Use of tables is not taught in the course, and in the exam, tables are not needed and will not be provided. All of the standard distribution probability and quantile values needed can be found in the R output below. > z [1] 0.84162 0.95000 0.99000 1.28155 1.64485 1.95996 2.57583 2.84400 > pnorm(z) [1] 0.80000 0.82894 0.83891 0.90000 0.95000 0.97500 0.99500 0.99777 > qt( c(.9,.95,.975), 8) [1] 1.396815 1.859548 2.306004 > qt( c(.9,.95,.975), 32) [1] 1.308573 1.693889 2.036933 > qt( c(.9,.95,.975), 33) [1] 1.307737 1.692360 2.034515 > qchisq(c(.025,.05,.1,.9,.95,.975),7) [1] 1.6899 2.1673 2.8331 12.0170 14.0671 16.0128 > qchisq(c(.025,.05,.1,.9,.95,.975),8) [1] 2.1797 2.7326 3.4895 13.3616 15.5073 17.5345 > qchisq(c(.025,.05,.1,.9,.95,.975),9) [1] 2.7004 3.3251 4.1682 14.6837 16.9190 19.0228 > qchisq(c(.025,.05,.1,.9,.95,.975),32) [1] 18.291 20.072 22.271 42.585 46.194 49.480 > qchisq(c(.025,.05,.1,.9,.95,.975),33) [1] 19.047 20.867 23.110 43.745 47.400 50.725 > qchisq(c(.025,.05,.1,.9,.95,.975),34) [1] 19.806 21.664 23.952 44.903 48.602 51.966 > qchisq(c(.025,.05,.1,.9,.95,.975),50) [1] 32.357 34.764 37.689 63.167 67.505 71.420 > qchisq(c(.025,.05,.1,.9,.95,.975),51) [1] 33.162 35.600 38.560 64.295 68.669 72.616 Similar information may be provided in the exam, and it is important to know how to interpret it and extract the required information. *1. In an experiment to determine the fuel consumption of a new model of car, a driver was employed to drive nine new cars, each for 100km. The fuel consumption in litres for each of the nine 100km drives is displayed in the table below. The data is contained in the Statistics 1 data set fuel. 12.09 11.18 9.97 10.50 9.92 9.97 11.84 10.93 10.70. (a) State clearly any assumptions you make. Explore the data (e.g. with a stem and leaf plot) to confirm that your assumptions are appropriate. (b) find a 90% confidence interval for the mean fuel consumption per 100km for cars of this type. (c) Find a 90% confidence interval for the variance of fuel consumption per 100km for the population of cars of this type. 2. Neurobiological arguments suggest that learning to play an instrument may improve the spatial-temporal reasoning of pre-school children. A study measured the spatial-temporal reasoning of 34 preschool children before and after six months of piano lessons. The changes in the reasoning scores of the children are displayed in the table below. You may assume the data represents the observed values of a simple random sample of size n = 34 from a population with unknown mean µ and unknown variance σ 2 . The data is contained in the Statistics 1 data set piano. 2 2 5 9 7 −2 2 7 4 1 0 7 3 4 3 4 9 4 5 6 0 3 6 −1 3 4 6 7 −1 7 −3 3 4 4 Construct a 95% confidence interval for the population mean improvement in reasoning scores, stating clearly any assumptions you make. You should summarize and display the distribution of the data, say in a histogram, and hence check that your assumptions are appropriate. Under the same assumptions, construct a 95% confidence interval for the variance of the improvement in reasoning scores in the population. Note: Given a single data set xdata, containing n values with sample mean x̄ and sample variance s2 , the R command t.test(xdata,conf.level=0.95) will produce output that includes a 95% confidence interval based on the end points √ √ and cU = x̄ + tn−1;α/2 s/ n. cL = x̄ − tn−1;α/2 s/ n with α = 0.05. The confidence level can, of course, be changed as desired. You should answer the questions above by going through the relevant working yourself, but you may wish to use this command to check your answers in cases where it is appropriate. *3. Consider again the following failure-time data for the batch of 25 lamps, which you may assume is a simple random sample from an Exponential distribution with unknown parameter θ. The data is contained in the Statistics 1 data set lamp. 5.5 0.7 3.8 4.0 8.0 1.6 7.8 2.6 9.3 0.7 4.7 0.2 4.0 3.1 0.3 1.0 4.6 3.4 0.6 3.7 7.9 10.8 1.8 1.2 4.0 (a) Find an equal-tailed 95% confidence interval for the unknown parameter θ based on this set of 25 observations. (b) Let X1 , . . . , Xn be a simplePrandom sample of size n from the Exp(θ) distribution. You may assume that E(1/ ni=1 Xi ) = θ/(n − 1). Use this result to find the average length of a 95% confidence interval for θ based on a random sample of size n = 25, expressing your answer as a multiple of the unknown parameter θ. 4. Assume that a random sample of 1000 electors are interviewed and that 370 of those interviewed say that they support the govenment. Find a 99% confidence interval for the proportion of electors that support the govenment. *5. Assume the 25 observations below are a random sample from the Unif(0, θ) distribution. 1.41 2.04 0.11 4.44 0.61 5.48 4.06 1.53 2.81 0.10 4.23 4.82 2.68 5.99 4.43 2.35 2.98 0.07 4.15 3.24 0.10 5.83 4.04 1.57 5.57 For the Unif(0, θ) distribution we saw earlier that the method of moments estimate θbmom and the maximum likelihood estimate θbmle were given by θbmom = 2X, where X is the sample mean, and θbmle = X(n) , where X(n) = max(X1 , . . . , Xn ) is the sample maximum. (a) Use the fact that for a random sample of size n from the Unif(0, θ) distribution, P (X(n) /θ ≤ v) = v n for 0 < v < 1, to find values v1 and v2 such that P (X(25) /θ < v1 ) = 0.025 and P (X(25) /θ > v2 ) = 0.025. Hence, following the general idea seen in construction of other confidence intervals, but with different details, find an equal-tailed 95% confidence intervals for θ based on θbmle . (b) Find an equal-tailed 95% confidence intervals for θ based on θbmom . [Hint: Use the Normal approximation to the distribution of X based on the Central Limit Theorem, 2 2 b and replace the factor of σ in the variance by θmom . You should obtain a confidence interval as a multiple of θbmom ]. (c) Comment on which of these two approaches gives a smaller confidence interval, and whether the intervals you get are compatible with the data. MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 7: Hypothesis tests Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. A tyre company claims its tyres have a mean useful lifetime of 42,000 miles. A consumer association bought one of the company’s tyres from each of 10 randomly chosen outlets and tested them on a test rig that simulated normal road conditions. The observed lifetimes (in thousands of miles) were 42 36 46 43 41 35 43 45 40 39. Thinking carefully about the context, what is an appropriate alternative hypothesis H1 to use in testing the manufacturer’s claim? Use the data to test whether or not there is sufficient evidence to reject the manufacturer’s claim, using a test procedure with significance level α = 0.05. The data are contained in the Statistics 1 data set tyre.lifetimes. [Your answer should include a statement of any model assumptions, a brief description of your working at each stage of the test procedure including the p-value and the critical region for the test, and a summary of your conclusions. You may find it helpful to check your numerical calculations using the t.test() function in R . If you do this question by hand calculation, it may help to know that pt(0.8808,9) gives 0.7993325 and qt(0.95,9) gives 1.833113.] *2. A certain manufacturer produces packets of biscuits with a nominal weight of 200g. You may assume that it is known from experience that the standard deviation of the weight of the packets is 4g. To carry out a control check on the actual weight of the packets produced, an employee weighs 25 packets selected at random from a day’s production and finds that the average weight of the sample is x = 202.275g. Let µ denote actual the mean weight of 200g packets produced by the manufacturer. Test the null hypothesis H0 : µ = 200 against the alternative H1 : µ 6= 200, using a test procedure with significance level α = 0.01. For what range of significance levels would you reject H0 in favour of H1 ? *3. A random variable X is known to have a Normal distribution with mean µ and variance 25. To test the hypotheses H0 : µ = 100 versus H1 : µ > 100 a test procedure is proposed which would take a simple random sample of size n from the population distribution of X and reject H0 in favour of H1 if the sample mean x > 102, and otherwise accept H0 . Find an expression in terms of the sample size n for the significance level α of this test procedure. Hence find the smallest sample size n∗ for which the significance level would be less than 0.05. If the alternative were H1 : µ = 103, calculate the power of this test for a sample of size n∗ and of a test of size 10n∗ . MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 8: Comparison of population means Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) Note: for each of the following questions your answer should include a statement of any model assumptions, a brief description of your working at each stage of the test procedure including the approximate p-value and the critical region for the test, and a summary of your conclusions. You may find it helpful to check your numerical calculations using the t.test() function in R . *1. You have learned about several different hypothesis tests that are relevant to different situations. It is vitally important that you apply the correct test in each situation. The following are brief descriptions of experiments that are designed to test a hypothesis. For each description write down (i) the null and alternative hypothesis, and (ii) the type of hypothesis test that should be used. There is no need to write down the test statistic or perform the test. (a) A car manufacturer claims that the level of nitrous oxide emissions from its new engine is lower than from its old engine. A researcher evaluates eight engines of the old type and nine of the new type. As a preliminary step the researcher establishes that there is no difference between the variances of nitrous oxide emissions in both engine types. (b) A researcher measures the systolic blood pressure (SBP) of 20 men and 20 women in a clinic. She wants to know whether there is a difference between SBP in men and women. (c) Students willingly volunteer to test whether alcohol increases reaction times. A random sample of 24 students have their reaction times measured and are then given an alcoholic drink. Their reaction times are measured again, half an hour later. (d) A manufacturer makes chocolate bars that are advertised as having a weight of 50g. To carry out a production control check, employees select 30 bars at random from a day’s production. They want to make sure that the company is not manufacturing underweight bars. 2. In a study of blood glucose level, measurements were made on a sample of 52 pregnant women in their third trimester of pregnancy. Their values (in milligrams/100 millilitres of blood) were found to have a sample mean of 70.12. Healthy women who are not pregnant are known to have a mean value of 80 (mg/100ml, with a standard deviation of 10 (mg/100ml), and we will assume that the standard deviation for pregnant women is also 10 (mg/100ml). (a) Use the data to test the hypothesis that pregnant women have a lower glucose level than women who are not pregnant. [Your answer should include a statement of any model assumptions, a brief description of your working at each stage of the test procedure including the p-value, and a summary of your conclusions. You will probably need to use R to calculate the actual p-value.] (b) Now assume the test was carried out using a procedure with significance level α = 0.01. Calculate the probability of type II error under the alternative hypothesis that the true mean value of glucose level in pregnant women is 79 (mg/100 ml). Hence calculate the power of the test to discriminate between the null hypothesis and this alternative. 3. (This question reminds you that sometimes the null hypothesis is true but due to chance the test statistic is such that the null hypothesis is rejected. In particular, under a test procedure with significance level α, about 100α% of samples will result in the null hypothesis being rejected even when H0 is true. ). In this question we investigate the one-sided one sample t-test of the null hypothesis H0 : µ = µ0 against the alternative H1 : µ > µ0 , using a test procedure with significance level α = 0.05. (a) Choose valid numerical values of µ0 and σ 2 for normally distributed data. Use the R commands below to construct a matrix xsamples with 1000 rows and 10 columns. The 10 data values in each row can be thought of as a random sample of size 10 from a N (µ0 , σ 2 ) distribution. You will need to substitute in your own chosen values of µ0 and σ, or store them as separate objects in R . > xvalues <- rnorm(10000, mean=mu0, sd=sigma) > xsamples <- matrix(xvalues,nrow=1000) Now calculate the sample mean and sample standard deviation for each sample, and store them as sample.mean and sample.sd respectively. (b) Calculate the observed test statistic for each sample, using the command: > sample.tobs <- sqrt(10) * (sample.mean - mu0) / sample.sd again substituting in your chosen value of µ0 . Plot a histogram of the relative frequencies of the 1000 observed test statistics that you obtain. What sort of distribution does the histogram portray? What distribution should we expect to see? (c) For a one-sided test the critical region will be of the form C = {T > c∗ }. Calculate the exact value of c∗ for n = 10 and mark this point with a vertical line on your histogram. What do you notice? (d) On average, how many of the 1000 sample test statistics should be inside the critical region? How many of the tests in your sample were significant? *4. Eight athletes ran a 400 metre race at sea level and at a later meeting ran a 400 metre race at high altitude. Their times in seconds are shown in the table below. Test at the null hypothesis that race times are unaffected by altitude, against the alternative that race times are greater at high altitude using a test procedure with significance level α = 0.05. The data are contained in the Statistics 1 data set runner. Runner Sea Level High Altitude 1 2 3 4 5 6 7 8 48.3 47.6 49.2 50.3 48.8 51.1 49.0 48.1 50.4 47.3 50.8 52.3 47.7 54.5 48.9 49.9 [These values may be useful: pt(2.1958,7) is 0.9679363 and qt(0.95,7) is 1.894579.] *5. To investigate the relative size of secretarial starting salaries in the public and private sector, 9 private sector posts and 10 public sector posts were chosen at random from jobs advertised on the web. The table below shows the advertised stating salaries (in £1,000; this data set is quite old!). You may assume that the population variances are the same in the private and public sectors. Use the data to test at the 0.05-level the hypothesis that starting salaries are the same in the two sectors against the alternative that private sector starting salaries are higher. The data are contained in the Statistics 1 data set secretaries . Private sector Public sector 12.1 9.3 13.4 8.5 11.3 8.2 10.6 13.1 9.7 8.8 12.5 11.9 9.6 10.1 13.6 9.8 11.2 12.2 10.4 6. A study was conducted of 10 households to see if alerting them to high usage rates of electricity reduced their actual consumption. A small monitor was installed in each household, which activated a red flashing light whenever the current rate of usage exceeded a preset threshold. The monthly usage (in kilowatt-hours) before and after installation of the monitor is given below. Test at the 0.05-level whether the monitor is effective at reducing electrical consumption. The data are contained in the Statistics 1 data set kwh. Household Before After 1 940 900 2 3 4 5 6 7 8 9 10 1370 1030 2030 1540 2300 1800 910 640 1200 1230 1060 2100 1250 2200 1820 900 630 1110 7. In a study to examine whether increasing the amount of calcium in the diet reduced blood pressure, a group of 10 men were given a calcium supplement in their diet for 12 weeks, and a control group of 11 men received a placebo (a pill that appeared identical, but contained no active substance). The table below shows the relative change (in mm of mercury) in blood pressure over the 12 week period (before - after) for each subject. What do you conclude from the results shown? The data are contained in the Statistics 1 data set bp. Calcium group Placebo group −7 4 −18 −17 3 5 −1 1 −12 1 3 −3 5 −5 −10 −11 2 −2 11 1 3 MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 9: Linear regression Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) *1. As part of a study of the relationship between ‘stress’ and ‘skill’, the stress levels for eight second year student volunteers were assessed and compared with their subject skill levels, as measured by each student’s average mark at the end of their first year. Summary Pstatistics forPthe data set are: P 2 P 2 P n=8 xi = 492 yi = 379 xi = 32, 894 yi = 20, 115 yi xi = 21, 087 where xi is the subject skill level for the ith student and yi is their assessed stress level. Calculate by hand the least squares estimates of α and β and the equation for the fitted regression line, under the simple linear regression model E(Y |x) = α + βx. What broad conclusion can you draw immediately from the fitted model? What stress level would you predict for a student with skill level x = 60? 2. The table below shows a data set with five pairs of values (xi , yi ), i = 1, . . . , 5. It is thought that the data satisfy the simple linear regression model E(Y |x) = α+βx, Var (Y |x) = σ 2 . x 1 y 0 3 1 4 2 6 5 7 4 (a) Calculate by hand the least squares estimates of α and β. b i and the residual (b) For i = 1, . . . , 5, calculate by hand the fitted values ybi = α b + βx values ebi = yi − ybi . Hence calculate by hand an estimate of the common variance σ 2 . Also, find the sum of the residuals. 3. The table below shows the average weight (in kg) of piglets in a litter, for seven litters of varying size. The data are contained in the Statistics 1 data frame pig, variables littersize and wt respectively. Litter size (x) Average weight (y) 1 3 5 8 8 9 10 1.6 1.5 1.5 1.3 1.4 1.2 1.1 Perform a simple linear regression of average weight on litter size and output the results to the R object piglets, with the commands: > attach(pig); piglets <- lm(wt ˜ littersize) Calculate by hand the least squares estimates α b and βb for the simple linear regression model E(Y |x) = α+βx and confirm your answers with the R command: > coef(piglets) Draw a scatter plot of the data and add in the fitted regression line, using the commands: > plot(littersize,wt); abline(coef(piglets)) Use your fitted regression line to predict the average weight of a piglet in a litter of size 6. Let xi denote the litter size for the ith litter and let yi denote the corresponding average b i and the residual weight for the piglets in that litter. Inspect the fitted values ybi = α b + βx values ebi = yi − ybi with the commands: > fitted(piglets); residuals(piglets) Plot the residuals against the litter sizes using the command: > plot(littersize, residuals(piglets)) and comment on the fit of the model. *4. The table below shows the rainfall (in inches) for the spring (April/May) and the following autumn (September/October) for each of ten consecutive years. The data are contained in the Statistics 1 data frame rain, in variable spring and autumn respectively. (Since there is a dataset called spring, you cannot attach the dataset rain, but you can extract columns using rain$spring etc.) Spring rainfall (x) Autumn rainfall (y) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 3.9 8.3 (a) Let xi denote the spring rainfall in the ith year and let yi denote the corresponding autumn rainfall. Use R to calculate the least squares estimates α b and βb for the simple linear regression model E(Y |x) = α + βx. Use R to produce a scatter plot of the data and use R to add in the fitted regression line. b i and the residual (b) For i = 1, . . . , 10, use R to calculate the fitted values ybi = α b + βx values ebi = yi − ybi . Use R to plot the residuals against the spring rainfalls, and comment of the fit of the model. *5. Consider a regression problem where the data values y1 , . . . , yn are observed values of response variables Y1 , . . . , Yn . In the notes we assume that, for given values x1 , . . . , xn of the predictor variable, the Yi satisfy the simple linear regression model Yi = α + βxi + ei , where the ei are i.i.d. ∼ N (0, σ 2 ). The least squares estimates of the regression parameter(s) are defined to be the values which minimise the sum of squares of the differences between the observed yi and the fitted values. For this Pnmodel, E(Yi | xi ) =2 α + βxi , so the least squares estimates are the values minimising i=1 (yi − (α + βxi )) . Now consider alternative model which takes the form Yi = γxi + ei , i = 1, . . . , n, where the ei satisfy the same assumptions as before but where there is now a single unknown regression parameter γ. This model is sometimes used when it is clear from the problem description that the value of E(Y ) must be zero if the corresponding x value is zero. Derive an expression, in terms of the xi and yi values, for the least squares estimate for γ for this new model and suggest, with reasons, an appropriate estimate for σ 2 . MATH11400 Statistics 1 2016-17 Homepage https://people.maths.bris.ac.uk/∼maotj/teaching.html Problem Sheet 10: Linear Regression: Confidence Intervals & Hypothesis Tests Questions marked with a star are to be handed in and marked by your tutor. Remember: when online, you can access the Statistics 1 data sets from R (3.2.3 or later) by typing load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")) 1. The table below shows a data set with five pairs of values, (xi , yi ) i = 1, . . . , 5. Assume that the data satisfy the simple Normal linear regression model Yi = α + βxi + ei , where the ei are i.i.d. N(0, σ 2 ). x 1 y 0 3 1 4 2 6 5 7 4 (a) Test the null hypothesis H0 : α = 0 against the alternative H1 : α 6= 0 using a test procedure with significance level 0.05. (b) Test the null hypothesis H0 : β = 0 against the alternative H1 : β 6= 0 using a test procedure with significance level 0.05. In each case calculate the p-value of the observed test statistic. *2. A study was conducted to examine the dependence of metabolic rate on body mass for 7 dogs, yielding data given in the table below. Body mass (kg) Metabolic rate (kcal/day) 31.20 1113.2 24.00 981.8 19.80 908.2 18.20 840.8 9.61 626.2 6.50 429.5 3.19 280.9 It was decided to analyse these data on a log scale; defining x as log body mass and y as log metabolic rate, we fit a Normal linear regression model Yi = α + βxi + ei . Summary statistics calculated from the data are P P P 2 P 2 P xi = 17.800 yi = 45.590 xi = 49.239 yi = 298.433 yi xi = 118.365. Calculate the end points of 99% confidence intervals for α and β. [In R , I found qt(c(0.95,0.975,0.99,0.995),5) gives [1] 2.015048 2.570582 3.364930 4.032143; some of these values may be useful.] 3. The table below shows the rainfall (in ins) for the spring and the following autumn for each of ten consecutive years. Let xi denotes the observed spring rainfall in the ith year and yi denotes the corresponding observed autumn rainfall. Assume that the data satisfy the simple Normal linear regression model Yi = α + βxi + ei , where the ei are i.i.d. N(0, σ 2 ). The data are contained in the Statistics 1 data set rain. Spring rainfall (x) Autumn rainfall (y) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 3.9 8.3 (a) Find a 90% confidence interval for β. (b) Test the null hypothesis H0 : β = 0 against the alternative H1 : β 6= 0, using a test procedure with significance level equal to 0.10. (c) Show that in general a test of the null hypothesis H0 : β = 0 against the alternative H1 : β 6= 0 with significance level (say) γ will accept the null hypothesis if and only if the corresponding 100(1 − γ)% confidence interval for β contains the value β = 0. *4. The table below shows the average weight (in kg) of piglets in a litter, for seven litters of varying size. The data are contained in the Statistics 1 data set pig. Litter size (x) Average weight (y) 1 3 5 8 8 9 10 1.6 1.5 1.5 1.3 1.4 1.2 1.1 Summary statistics are: P P P P P xi = 44, yi = 9.6, x2i = 344, yi2 = 13.36, yi xi = 57. When a simple linear regression of average weight on litter size is performed in R using the command piglets <- lm(wt ˜ littersize,data=pig), the output of the command summary(piglets) includes the following lines: Coefficients: Estimate Std. Error t value Pr(>|t|) 1.683051 0.064520 26.086 1.55e-06 *** -0.049576 0.009204 -5.387 0.00297 ** (Intercept) littersize --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.07558 on 5 degrees of freedom (a) Show how each of the numerical values on the three lines beginning (Intercept), littersize and Residual standard error were calculated, and explain the interpretation of the asterisks *** and ** at the end on the lines. (b) What conclusion would you reach from the output for testing the null hypothesis H0 : β = 0 against the alternative H1 : β 6= 0?