Download Estimation: Practical exercises using R

WTCHG course in statistical modelling and data analysis Self-assessment exercises Gil McVean The aim of this series of questions is to revisit some of the key ideas presented in the first two weeks so as to make sure you have a thorough grasp of some of the key concepts in statistical analysis. Each question has three parts – a simple question to answer, a short practical exercise to implement and a more advanced question for those who want a challenge. 1. Plotting data A) What aspects of your data might you aim to summarise graphically by the use of i) a histogram, ii) a boxplot and iii) a QQplot? B) Using R, read in the file “gc_content.txt”, which contains measurements of GC content in 170 contiguous windows of 1000bp along a part of chromosome 1 in humans. Plot the data in the second column using a histogram and a boxplot. Summarise what these figures tell you about the distribution of the data. Using a QQplot, compare the distribution of these data to the standard normal distribution. C) How might you try to reject the null hypothesis that these data are drawn from a normal distribution? Devise some statistic that is sensitive to departures from normality and use a simulation procedure to assess the evidence for a departure from normality in the GC content data. 2. Distributions A) Explain the relationship between the exponential distribution, the Poisson distribution and the gamma distribution. B) Using R, read in the data set “coverage_data.txt”. These data measure the depth of a shotgun sequence data across the same 170kb region of human chromosome 1 as in the previous question. Plot the data in the second column using a histogram. We want to try to model the data by an exponential distribution. How can we use the method of moments to give us an estimate of the parameter for the distribution? Obtain an estimate and use a QQplot to investigate whether it is a good description. C) Fit a gamma distribution to the same data using the method of moments. Remember that the exponential is a special case of the gamma distribution. How different is the distribution that you have fitted to the exponential? 3. The Central Limit Theorem A) State the central limit theorem and explain why it is so important in statistics. Gil McVean Last modified 01/11/2008 WTCHG course in statistical modelling and data analysis B) Using R, simulate 1000 bootstrap samples from the coverage data and obtain the mean of each sample. What values do you expect for the mean and the variance of this distribution? Compare the distribution of these means to a normal distribution with the expected mean and variance. Now repeat, but calculating the variance of each sample. What distribution do you expect the sample variances to take? Why? C) The Pareto distribution is used to describe phenomena in which a small fraction of objects account for much of the activity/mass (grains of sand, traffic on the internet, sizes of human settlements, etc. – i.e. longtailed distributions). It is characterised by two parameters – a shape parameter k (k>0) and a scale parameter xm (xm >0). The pdf is given by kxmk f ( x | k , x m )  k 1 for x  xm x Using your powers of integration, work out whether the distribution has finite mean and variance and, consequently, whether the CLT holds for this distribution. 4. Estimating uncertainty A) Explain what is meant by a confidence interval. B) Calculate a 95% confidence interval for the mean of the GC content data. [Note, it would be good to go through the details of this calculation before checking that you agree with the results of applying t.test()]. C) It turns out that the first 100 and the second 70 observations in the coverage data come from different experiments. I want to test the hypothesis that the mean coverage differs between the two experiments. Using the two-sample t-test, obtain 95% confidence intervals for the difference in coverage between the two experiments. What is the probability of observing as big a difference under the null? What assumptions have you had to make in order to obtain these inferences? 5. Likelihood A) What is the likelihood function? Explain the meaning of the terms i) maximum likelihood estimate, ii) likelihood ratio test and iii) support interval. B) For the data on shotgun coverage calculate the (log) likelihood function for the parameter of the exponential distribution over a grid of (say 100) points between the minimum and maximum of the observed values. Find the value of the parameter that maximises the likelihood. Find the 2-unit support interval and compare this to a 95% confidence interval calculated using the approach in 5B. C) For the same data set, try to fit a gamma distribution using maximum likelihood. You cannot do this analytically, so one option is to construct a grid for the two parameters and find an approximation to the MLEs. Alternatively, you can use a numerical optimisation algorithm, such as the inbuilt function optim(). For example, if you have put the coverage data into a table called coverage with the second column containing the values, the following code will find MLEs for the parameters of the distribution Gil McVean Last modified 01/11/2008 WTCHG course in statistical modelling and data analysis fn<-function(par, data) { n<-length(data); xb<-mean(data); lxb<-mean(log(data)); return(-n*(par[1]*log(par[2])-lgamma(par[1])-par[2]*xb+(par[1]-1)*lxb)); } data<-coverage[,2]; a1<-mean(data)^2/var(data); b1<-mean(data)/var(data); optim(c(a1, b1), fn, gr=NULL, data); Calculate the maximum log-likelihood under the gamma fit and also under the exponential fit. Is it appropriate to use the standard theory of likelihood ratio tests to ask whether the gamma is a better fit than the exponential? If so, what is the probability of observing an increase in log-likelihood as great as that you observed if the null were true? 6. Linear modelling A) Suppose I have a response variable Y and two explanatory variables, X and Z. Which of the following is a valid linear model? (terms with  are the parameters to be estimated). y i   0   1 xi   2 z i yi  1 exp(  xi )   2 / z i y i   0  1 xi2   2 xi z i yi   0  1 xizi yi   0  1 xi 2 B) We would like to find out about which genomic features influence coverage in the shotgun experiment. You have already got data on GC content, but another important feature might be repeat content 9the presence of short or long elements that are found in many copies throughout the genome, such as transposons). In the file “repeat_content.txt” you will find the fraction of each 1kb region that is repeat DNA.  Carry out a linear model analysis using first GC, second repeat content and third both. Summarise what you have learnt.  Check whether the assumption of normality in the error is justified by examining the distribution of the residuals and seeing if there are any systematic biases  Try using a square-root transformation of the original data. Does this improve the residuals? Does it change your inferences? Gil McVean Last modified 01/11/2008 WTCHG course in statistical modelling and data analysis  How much of the variation in the original signal have you explained? What is the difference between Multiple R-squared and Adjusted R-squared? C) Using the function acf(), plot the autocorrelation of the residuals. What do you notice? Is this a problem for the linear model analysis? What does it suggest? 7. Bayesian inference A) What, if anything, is the Bayesian equivalent of a P value? B) Consider flipping a drawing-pin. What is the probability that it will land point up? Characterise your prior belief about this probability by sketching a distribution and then find values of the parameters of a beta distribution that make a suitable fit. Now flip a drawing-pin 10 times and record the number of each event. The posterior distribution for the probability of landing point up is now given by a beta distribution with parameters a+nU, b+n-nU, where a and b are the parameters from your prior, n is the number of trials (here 10) and nU is the number of times the pin landed point up. Use R to draw the prior and posterior. Carry on flipping the pin and look again at the posterior after 20, 30, 40 and 50 flips. Now combine information across the room. What is your point estimate for the probability? What is the 95% credibile interval (ETPI)? C) A Bayes factor can be used to represent the evidence for different models. Given the data you have just collected, measure the evidence that the probability of landing point up is not equal to 0.5 after different numbers of throws. What Bayes factor do you think is necessary to convince you that one model is better than another? Gil McVean Last modified 01/11/2008

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Estimation: Practical exercises using R