Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MTH5119 Lab 2: Exact mean and variance of the sampling distribution ofȳ under SRS, and simulating the samplng distribution to find the coverage when Nn is large In Lab 1 we found the exact sampling distribution of ȳ (and s2 ) for various sample sizes under SRS from the population of percentage assessments of 12 MTH5119 students in 2009. 1. Import popn.dat into C1 of MINITAB as before, naming the column popn. 2. Run srsplus.exe for sample size n = 4 to get the sampling distribution in newstat.dat, and import the two columns into C2, C3 of MINITAB. 3. Use Calc→Column Statistics on C1 to find the finite population moments , S2 = Ȳ = . Note if you want these exactly ie without rounding error, use MINITAB instead to get the column sum and divide by 12 to get the mean (using a hand calculator or MINITAB’s Tools→Windows Calculator). For the exact variance get the column sum of squares, subtract the column sum squared, then divide by 11. 4. Thus find Var[ȳ] from the formula in Theorem 2.1. 5. By using Calc→Column Statistics on C2, show that E[ȳ] = Ȳ which is the statement that the sample mean is an unbiased estimator of the population mean. Do this also for the sample variance in C3. 6. If you use Calc→Column Statistics→Standard Deviation on C2, and square the answer, (N ) you will NOT get Var[ȳ], but N n−1 Var[ȳ]. This is because MINITAB treats the column as (n) a sample of data, not the equally likely values in a discrete distribution (which is what we want). To get the correct value for Var[ȳ], calculate using E[e2 ] − (E[e])2 by MEAN(C2**2)-MEAN(C2)**2 in a vacant column. Verify that the answer agrees exactly (within rounding error) with your answer in 4. 7. This is of course not a proof of Theorem 2.1, but a demonstration that the theorem holds for THIS population. In fact it holds for any population and any sample size under SRS (see proof 2 in lectures). SIMULATION Today we are going to run a FORTRAN program which generates a sequence of independent random samples simulating the exact sampling distribution and hence only approximating the exact coverage probability for our chosen method. It is not really necessary to simulate when there are a manageable number of samples in the exact distribution, but for larger populations it may be necessary because of time or storage limitations. Even the highly skew ‘Wang’ population of N = 36 which we will examine later has 254186856 samples of size 10. We will first look at the accuracy of our simulation procedure for the example in Lab 1 and also other important questions e.g. is it possible to get coverage closer to 95% for smallish samples by replacing 1.96 in the nominal interval by a percentage point of the t-distribution as W.G.Cochran suggests? 1 1. From my personal course web page, save two files listed under ”Practicals”: (a) Wang’s population, calling the file wang.dat (NB: examine this file using Notepad say, to ensure that it contains only numbers and that the filename is correct without any extra file extensions like .txt. You SHOULD be able to achieve this smoothly under Mozilla by File→Save (Web) Page as, clicking on ”All Files” and entering the full filename in the dialog box.) (b) Simulation Program, calling the file srsfast.exe 2. Calculate the coverage indicator of the usual interval in C4 for each sample by Calc→ Calculator ABSO(C2-MEAN(C1))≤1.96*SQRT((1-4/12)/4*C3), then the EXACT coverage probability is just the column mean of C4, as in Lab1. 3. Now repeat the above calculation but with 3.182 (the 97.5 percentage point of Student’s t on 3 d.o.f.) substituted for 1.96, and report the new coverage probability. Do we now have overcoverage rather than undercoverage? 4. We will now run the program srsfast.exe which generates independent realisations from the sampling distribution, each realisation being the sample mean and variance of a randomly chosen (with replacement, so independence is maintained) sample from the Nn . It works by using Waterman’s algorithm which is an efficient method of generating a SRS. Enter sample size 4, population size 12 as before, and you also need to specify K, the number of independent SRSs, or the ‘size’ of the simulation. This should be large for accuracy, but how large? Enter (ten thousand) 10000 (return) and you will see the count of the number of samples generated. The 10000 values of ȳ and s2 are given in estims.dat, which you can read into MINITAB into C5 and C6, naming the columns ybar and ssqd. 5. Calculate the estimated coverage in C7 in the same way as in step 2, substituting C5 for C2, and C6 for C3 (or doubleclick on the variable names). How does the estimated coverage compare with the exact (true) coverage? i.e. how accurate is the estimated coverage? If we did not know the true coverage R say, but estimated instead by r, then a 95% C.I. for R is r r(1 − r) r ± 1.96 , K based on the usual binomial distribution for independent repetitions (note that as K is very large, always use the normal approximation (1.96) for this formula). Calculate this confidence interval for R using a hand calculator or the Microsoft one in MINITAB(Tools). Does it contain the exact (true) value? Is it too wide to give a good idea of R? 6. Repeat step 5 using the t-distribution (3.182) rather than the 1.96 (based on the normal approximation) in the calculation of C7. 2 7. Now try a real problem to examine the coverage for intervals based on the ‘Wang’ population. In your directory, rename popn.dat by MTH511909.dat, and then make a copy of wang.dat calling it popn.dat), so that it can be used by the program. 8. Read the new population (of size 36) into C1 in a new worksheet (save the old worksheet if you like for your coursework but PLEASE do not print it!) and calculate the mean, observing by say, a histogram that it is very skew. 9. Try to estimate by simulation the coverage probabilities of nominal 95% C.I.s based on the normal and t-distributions for samples of sizes 5, 10 , 15 (say). You will need the 97.5% points of the t-distribution on 4, 9 and 14 d.o.f., which are 2.776, 2.262 and 2.145 respectively. Use 10000 simulations and give a 95% C.I. for the coverage. (the case of 5 can be solved exactly as there are only 376992 possible samples). Does the suggestion by Cochran work well or is the coverage poor even with the t-distribution? Why do you think this is so? 3