Download MTH5119 Lab 2: Exact mean and variance of the sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
MTH5119 Lab 2: Exact mean and variance of the sampling distribution ofȳ under SRS,
and simulating the samplng distribution to find the coverage when Nn is large
In Lab 1 we found the exact sampling distribution of ȳ (and s2 ) for various sample sizes under
SRS from the population of percentage assessments of 12 MTH5119 students in 2009.
1. Import popn.dat into C1 of MINITAB as before, naming the column popn.
2. Run srsplus.exe for sample size n = 4 to get the sampling distribution in newstat.dat, and
import the two columns into C2, C3 of MINITAB.
3. Use Calc→Column Statistics on C1 to find the finite population moments
, S2 =
Ȳ =
.
Note if you want these exactly ie without rounding error, use MINITAB instead to get the
column sum and divide by 12 to get the mean (using a hand calculator or MINITAB’s
Tools→Windows Calculator). For the exact variance get the column sum of squares, subtract the column sum squared, then divide by 11.
4. Thus find Var[ȳ] from the formula in Theorem 2.1.
5. By using Calc→Column Statistics on C2, show that E[ȳ] = Ȳ which is the statement that
the sample mean is an unbiased estimator of the population mean. Do this also for the
sample variance in C3.
6. If you use Calc→Column Statistics→Standard Deviation on C2, and square the answer,
(N )
you will NOT get Var[ȳ], but N n−1 Var[ȳ]. This is because MINITAB treats the column as
(n)
a sample of data, not the equally likely values in a discrete distribution (which is what we
want). To get the correct value for Var[ȳ], calculate using E[e2 ] − (E[e])2 by
MEAN(C2**2)-MEAN(C2)**2
in a vacant column. Verify that the answer agrees exactly (within rounding error) with
your answer in 4.
7. This is of course not a proof of Theorem 2.1, but a demonstration that the theorem holds
for THIS population. In fact it holds for any population and any sample size under SRS
(see proof 2 in lectures).
SIMULATION
Today we are going to run a FORTRAN program which generates a sequence of independent
random samples simulating the exact sampling distribution and hence only approximating the
exact coverage probability for our chosen method. It is not really necessary to simulate when
there are a manageable number of samples in the exact distribution, but for larger populations it
may be necessary because of time or storage limitations. Even the highly skew ‘Wang’ population of N = 36 which we will examine later has 254186856 samples of size 10. We will first look
at the accuracy of our simulation procedure for the example in Lab 1 and also other important
questions e.g. is it possible to get coverage closer to 95% for smallish samples by replacing 1.96
in the nominal interval by a percentage point of the t-distribution as W.G.Cochran suggests?
1
1. From my personal course web page, save two files listed under ”Practicals”:
(a) Wang’s population, calling the file wang.dat (NB: examine this file using Notepad
say, to ensure that it contains only numbers and that the filename is correct without
any extra file extensions like .txt. You SHOULD be able to achieve this smoothly
under Mozilla by File→Save (Web) Page as, clicking on ”All Files” and entering the
full filename in the dialog box.)
(b) Simulation Program, calling the file srsfast.exe
2. Calculate the coverage indicator of the usual interval in C4 for each sample by Calc→
Calculator
ABSO(C2-MEAN(C1))≤1.96*SQRT((1-4/12)/4*C3),
then the EXACT coverage probability is just the column mean of C4, as in Lab1.
3. Now repeat the above calculation but with 3.182 (the 97.5 percentage point of Student’s t
on 3 d.o.f.) substituted for 1.96, and report the new coverage probability. Do we now have
overcoverage rather than undercoverage?
4. We will now run the program srsfast.exe which generates independent realisations from
the sampling distribution, each realisation being the sample mean and variance of a randomly chosen (with replacement, so independence is maintained) sample from the Nn . It
works by using Waterman’s algorithm which is an efficient method of generating a SRS.
Enter sample size 4, population size 12 as before, and you also need to specify K, the
number of independent SRSs, or the ‘size’ of the simulation. This should be large for
accuracy, but how large? Enter (ten thousand) 10000 (return) and you will see the count
of the number of samples generated. The 10000 values of ȳ and s2 are given in estims.dat,
which you can read into MINITAB into C5 and C6, naming the columns ybar and ssqd.
5. Calculate the estimated coverage in C7 in the same way as in step 2, substituting C5 for C2,
and C6 for C3 (or doubleclick on the variable names). How does the estimated coverage
compare with the exact (true) coverage? i.e. how accurate is the estimated coverage? If
we did not know the true coverage R say, but estimated instead by r, then a 95% C.I. for R
is
r
r(1 − r)
r ± 1.96
,
K
based on the usual binomial distribution for independent repetitions (note that as K is
very large, always use the normal approximation (1.96) for this formula). Calculate this
confidence interval for R using a hand calculator or the Microsoft one in MINITAB(Tools).
Does it contain the exact (true) value? Is it too wide to give a good idea of R?
6. Repeat step 5 using the t-distribution (3.182) rather than the 1.96 (based on the normal
approximation) in the calculation of C7.
2
7. Now try a real problem to examine the coverage for intervals based on the ‘Wang’ population. In your directory, rename popn.dat by MTH511909.dat, and then make a copy of
wang.dat calling it popn.dat), so that it can be used by the program.
8. Read the new population (of size 36) into C1 in a new worksheet (save the old worksheet
if you like for your coursework but PLEASE do not print it!) and calculate the mean,
observing by say, a histogram that it is very skew.
9. Try to estimate by simulation the coverage probabilities of nominal 95% C.I.s based on
the normal and t-distributions for samples of sizes 5, 10 , 15 (say). You will need the
97.5% points of the t-distribution on 4, 9 and 14 d.o.f., which are 2.776, 2.262 and 2.145
respectively. Use 10000 simulations and give a 95% C.I. for the coverage. (the case of 5
can be solved exactly as there are only 376992 possible samples). Does the suggestion by
Cochran work well or is the coverage poor even with the t-distribution? Why do you think
this is so?
3