Download Sampling in SPSS and R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Robin Beaumont [email protected]
Options for demonstrating sampling variability and sampling
distributions in teaching statistics
Tuesday, 11 October 2011
Contents
Sampling in SPSS and R ................................................................................................................................................. 2
1
Using SPSS............................................................................................................................................................. 2
1.1
Using SPSS syntax ......................................................................................................................................... 2
1.1.1
One sample ........................................................................................................................................... 2
1.1.2
Multiple samples all the same size and from same distribution. ......................................................... 3
1.1.3
Samples of different sizes ..................................................................................................................... 4
1.1.4
Sampling distributions .......................................................................................................................... 6
2
Online Apps........................................................................................................................................................... 7
3
The standard error of the Mean ........................................................................................................................... 7
3.1.1
4
Effect of sample size upon SEM - formula appreciation....................................................................... 7
Using SPSS script ................................................................................................................................................... 8
4.1.1
Alternative script - Distribution.sbs ...................................................................................................... 9
5
In R ......................................................................................................................................................................10
6
Online presentations and other tools................................................................................................................11
Sampling in SPSS and R
The aim of this handout is to describe the various options available for teaching the concept of sampling
variability along with some student material.
The process usually involves creating samples and then comparing them with both the parent population and
amongst themselves (SEM demonstration).
I have offered four ways of doing this below; Using SPSS (two methods) online apps and R.
1 Using SPSS
1.1 Using SPSS syntax
The traditional way of investigating random samples in SPSS is to use the SPSS syntax window:
1.1.1 One sample
Simple example to create a single sample with 1000 cases from a Normal distribution with mean = 100 ; SD=15:
SPSS syntax
*example of creating a random sample
* Create 10,000 cases for sample
NEW FILE.
INPUT PROGRAM.
LOOP #1 = 1 TO 10000.
COMPUTE X = RV.NORMAL(100,15).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
And to get a boxplot:
Next exercise is to produce several samples.
Use Analyze the get the results
1.1.2 Multiple samples all the same size and from same distribution.
Variables called V20 to V30, all the same size. I have assumed that you have run the above syntax first if not you
need to use the syntax below right:
If have run above script
If have not run above script
NUMERIC V20 to V30.
vector v = V20 to V30.
* loop for sample size
NEW FILE.
INPUT PROGRAM.
NUMERIC V20 to V30.
vector v = V20 to V30.
* loop for sample size
LOOP #case = 1 TO 100.
*loop for each sample
LOOP #case = 1 TO 100.
*loop for each sample
LOOP
LOOP
#i= 1 TO 11.
#i= 1 TO 11.
*now we have to specify both column(sample)
and row (sample number)
*now we have to specify both column(sample)
and row (sample number)
COMPUTE v(#i) = RV.NORMAL(100,15).
END LOOP.
COMPUTE v(#i) = RV.NORMAL(100,15).
END LOOP.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
END LOOP.
EXECUTE.
EXECUTE.
Typical output:
V20
V21
V22
V23
V24
V25
V26
V27
V28
V29
V30
N
100
100
100
100
100
100
100
100
100
100
100
Descriptive Statistics
Mean
Std. Deviation
101.5421
14.20531
101.0039
15.53362
99.1124
14.14247
97.6240
14.07071
99.9382
14.43248
100.1818
13.80487
100.4502
15.45697
101.6055
15.04477
100.8888
14.05551
101.6523
14.24829
99.9043
14.19884
1.1.3 Samples of different sizes
Two main ways to do this, you can create all the samples in a single variable and add a Grouping variable or
alternatively create several variables with different sample sizes in each. For various reason the former strategy is
best however just for interest I have included below the latter option of putting the various samples of different
sizes in separate variables:
NEW FILE.
INPUT PROGRAM.
LOOP #count = 1 TO 500.
DO IF (#count <31).
COMPUTE samp30 = RV.NORMAL(100,15).
END IF.
DO IF (#count <51).
COMPUTE samp50 = RV.NORMAL(100,15).
END IF.
DO IF ( #count <101).
COMPUTE samp100 = RV.NORMAL(100,15).
END IF.
COMPUTE samp500 = RV.NORMAL(100,15).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
This approach (i.e. separate variable each sample)
causes problems when analysing the data as SPSS
considers the smaller samples to have missing values!
Therefore the better solution is to use a grouping
variable that is an identifier indicating the sample each
observation(case) belongs to.
The next SPSS syntax script duplicates the above but just creates two variables (one called GROUP the other
VALUE) here:
new file.
input program.
loop #i=1 to 30.
compute group=1.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 50.
compute group=2.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 100.
compute group=3.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 500.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
execute .
The code opposite is not the most elegant you could
use one loop with a number of 'DO IF' statements:
new file.
input program.
loop #i=1 to 500.
DO IF (#i<31).
compute group=1.
compute value=rv.Normal(100,15).
end case.
END IF.
DO IF (#i<51).
compute group=2.
compute value=rv.Normal(100,15).
end case.
END IF.
DO IF (#i<101).
compute group=3.
compute value=rv.Normal(100,15).
end case.
END IF.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
SORT CASES by group(a).
execute .
Both the above SPSS syntax files do the same thing that is produce four samples of different size from a normal
distribution with mean 100 SD=15.
Obviously you could easily change the parameters of the distribution or even change the actual distribution, Two
alternatives are:
the uniform: rv.Uniform(lower, upper) or
exponential: rv.exp(mean)
Using the Explore command in SPSS shows the SD for each group and also a box plot.
Carrying out the above tasks it is then possible to
complete the following table.
Sample
size
Minimum
value
mean
Maximum
value
Standard
deviation
30
50
100
500
Theoretical
population
value
The above exercise will demonstrate;
Standard deviation varies little over sample size - there must be a sample adjustment factor in it!
Mean also varies little (repeated sampling for smaller samples produces wider variation - next exercise)
from the population mean of 100
The above exercise can then be repeated changing the sample size to 3, 10, 20, 30
new file.
input program.
loop #i=1 to 3.
compute group=1.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 10.
compute group=2.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 20.
compute group=3.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 30.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
execute .
Given these are random samples each person will obtain a different result however what they should notice is
that the means(medians in above boxplot) vary less as the sample size gets larger. You could ask them to
repeatedly create multiple random samples of varying size then plot the means (technically what we would
produce is a sampling distribution of the mean) but at this stage it is probably better to revert to online
simulations (see below).
1.1.4 Sampling distributions
Student typical explaination:
So far we have looked at the characteristics of one or more samples from a population but what about the
characteristics across samples! Why, you may well ask, would we bother with such additional complexity but just
consider this:
I have a valuable substance (Guinness) and only want to take as small sample as possible to find an accurate mean
value of substance X.
So how can we calculate what would be a small enough sample to produce a accurate mean value?
To answer this question obviously we need to assess the variation of means across samples of a specific size.
While we have done this for a small number of samples we will now consider many samples to produce a
distribution.
2 Online Apps
Go to http://onlinestatbook.com/stat_sim/sampling_dist/index.html
Using the app at this website we can ask for repeated samples of different sizes and then plot their means. I have
done it for 10,000 samples of size 5 and also size 25
- Students should notice how much more spread out
the means are for the smaller samples.
.
Student explanation:
3
The standard error of the Mean
The Standard Error of the Mean provides a measure of
the standard deviation of sample means. In other
words it is just another standard deviation but now we
are at the between sample level rather than within
sample level. Because we are working at a different
level the name has changed for the same idea
concerning spread. From the above exercise, we have
both the population data along with information about
a set of samples from it. Interestingly all we need to
calculate the SEM is information from a single sample.
We will now compare the observed answer (for the
samples in the above screen shot = 2.23 for samples of size 5) with a specific formula. This formula is known as
the SEM (Standard Error of the Mean).
𝜎2
𝜎𝑥̅ = √ 𝑛 =
𝜎
√𝑛
=
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒
𝑠𝑞𝑢𝑎𝑟𝑒 𝑟𝑜𝑜𝑡 𝑜𝑓 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
= 5/√5 = 2.236
and for the sample size of 25 SEM = 5/√25 = 1
We can see from the above formula that the Standard Error of the Mean is equal to the standard deviation
divided by the square root of the sample size. We have samples of size 5 and 25 so we can calculate the SEM
from each one. You will notice that the observed SD of the sample means is identical to that using the formula this is truly amazing We can predict the distribution of means of random samples without carrying out the
sampling just using the SEM formula.
3.1.1 Effect of sample size upon SEM - formula appreciation
We know that the formulae for the standard error of the mean (SEM) is:
𝜎2
𝜎
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒
𝜎𝑥̅ = √ =
=
𝑛
𝑠𝑞𝑢𝑎𝑟𝑒 𝑟𝑜𝑜𝑡 𝑜𝑓 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
√𝑛
Lets consider what happens to the SEM as the sample size changes. From the above equation the top value
(numerator) will remain constant, but the bottom value (denominator) will increase. What happens in this
instance, which is a property of all fractions, is that the total value decreases, therefore as sample size increases
the variability of the sample means decreases. You can think of it in terms of accuracy, the larger the random
sample the more accurate the SEM, a statistician would say that this indicated that it was a consistent estimator
As N increases -> SEM decreases
To learn more about SPSS syntax see the excellent tutorial including datasets and videos at:
http://www.ats.ucla.edu/stat/spss/seminars/spss_syntax/default.htm
4 Using SPSS script
SPSS scripts allow users to create additional dialog boxes and several people have produced scripts which provide
dialog boxes for creating random samples. This is probably an easier
alternative to learning SPSS syntax.
http://www.spsstools.net/SampleScripts.htm provides three possible
scripts
Right mouse click on the "Generate Random variables EN SBS" link select the "Save Link as" option to save the
script file to your local drive change the default extension from txt to sbs.
Back in SPSS:
This allows you to create multiple samples of a specific size. You can also run the
script several times to create many samples by un-checking the "Replace the
working data file" option.
4.1.1
Alternative script - Distribution.sbs
You will then be presented with:
Type in the sample size you want:
Step 1 - click next to allow you to select:
Step 2 - the distribution, I selected Normal
Step 3 - - you can change the mean, SD.
Once you have created one sample you can create up
to 20 different ones each time clicking next
To finish click the Finish button!
Typical results using the menu option explore:
Case Processing Summary
group
value
dimension1
1.00
2.00
3.00
4.00
N
30
20
15
10
Valid
Percent
100.0%
100.0%
100.0%
100.0%
N
0
0
0
0
Cases
Missing
Percent
.0%
.0%
.0%
.0%
N
30
20
15
10
Total
Percent
100.0%
100.0%
100.0%
100.0%
5 In R
R is not for the lazy! but it is amazingly versatile. This section is for completeness.
# this is a comment
#create a plot x axis=0 to 62 y axis=50 to 150
# Give the axes labels
plot(c(0,62), c(50,150), type="n",xlab="Sample size", ylab="mean")
#sample size 3 to 30 in steps of 2 (=df)
for (df in seq(3,61,2))
{
# number of samples (=60) at each size
for (i in 1:60)
{
# create random samples from a normal distribution of size df
# and store in the vector (column) x
x<- rnorm(df,mean =100, sd=15)
points(df,mean(x)) } # end for each group of samples
} # end for each sample size
You can see an animated version of the above at: http://animation.yihui.name/prob:law_of_large_numbers this
site has a large number of animations all written in r code using the free R animation package. To the casual
visitor all the R code is hidden away they just seeing the beautiful animations.
With more R knowledge one can create more complex examples, the following is taken from Maindonald & Braun
3rd ed. 2010 p. 89. This produces 10,000 simulations of different samples of different sizes from a skewed
distribution. The code below can be used as the basic for a large number of similar exercises.
############################### from Miandolald
& Braun p.89-90
########
CUP 2010
## uses the lattice library
library(lattice)
##############
# function to generate n sample values
sampvals <- function(n) exp(rnorm(n, mean = 0.5,
sd = 0.3))
## Means across rows of a dimension nsamp x
sampsize matrix of
## sample values gives nsamp means of samples of
size sampsize.
samplingDist <- function(sampsize = 3, nsamp =
1000, FUN = mean)
apply(matrix(sampvals(sampsize * nsamp), ncol =
sampsize), 1, FUN)
size <- c(3, 10, 30)
## Simulate means of samples of 3, 9 and 30;
place in dataframe
df <- data.frame(y3 =
samplingDist(sampsize=size[1]),
y9 = samplingDist(sampsize=size[2]),
y30 =samplingDist(sampsize=size[3]))
###############
## use the strip.custom to customise the strip
labelling
doStrip <- strip.custom(strip.names = TRUE,
factor.levels= as.expression(size), var.name= "
sample size", sep = expression(" = "))
## Then include the argument 'strip=doStrip' in
the call to densityplot
###############
## Simulate source population (sampsize = 1)
y <- samplingDist(sampsize = 1)
densityplot(~y3+y9+y30, data=df, outer=TRUE,
layout= c(3,1),
plot.points = FALSE, panel = function(x, ...) {
panel.densityplot(x,..., col = "black")
panel.densityplot(y, col = "gray40", lty=2, ...)
}, strip=doStrip)
6 Online presentations and other tools
The new Zealand census at school Website http://www.censusatschool.org.nz/resources/statistical-investigation/
contains a section on informal inference, called "The eyes have it"
http://www.censusatschool.org.nz/2009/informal-inference/ which contains animated gifs that people can use in
their representations and also an excellent presentation concerning sampling variability and how this can
informally relate to hypothesis testing see:
http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
http://www.censusatschool.org.nz/2009/informal-inference/Maxine.Combined.Use.html
End of document