Download Drawing Multiple Random Samples with Replacement

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Randomness wikipedia , lookup

Transcript
NESUG 16
Coders' Corner
CC001
Drawing Multiple Random Samples with Replacement
Charles A. DePascale, Center for Assessment
In the case of drawing random samples with replacement, all of
the assumptions are met. As each record is drawn for the sample, a record is either selected or not selected each time and the
probability that a particular record will be selected is always 1
divided by the number of records in the original sample.
ABSTRACT
Using SAS, there are several efficient methods to draw random
samples with and without replacement. This paper presents a
method for drawing multiple random samples with replacement
from a data file in a single pass through the data.
Therefore, based on the binomial distribution, it is possible to
determine when drawing a sample with replacement from a data
set of 50,000 records the probability that a record will be drawn 0
times, 1 time, 2 times, etc. Using those probabilities, it is possible
to draw a random sample with replacement by applying the selection probabilities to each record in the original data set.
The approach is based on an application of the binomial distribution when sampling with replacement and uses the PROBBNML
function to estimate the probability that a record will be selected a
certain number of times.
INTRODUCTION
DETERMINING CASE WEIGHTS
The purpose of this paper is to provide an alternative technique
for drawing random samples with replacement from a single data
set or population.
The first step in the process is to determine the selection probabilities. The PROBBNML function computes the probability that
an observation from a binomial (n,p) distribution will be less than
or equal to m. The syntax of the function is PROBBNML(p,n,m)
where, in this case,
Applying the basic technique to draw a single sample requires
only the use of Base SAS and an understanding of PROC FORMAT, the use of the PUT and INPUT functions, and a random
number generator. Extending the technique to draw multiple samples requires only the use of an array and DO loop.
n is the number of records in the original group and sample,
p, the probability of selection for each record, is 1/n, and
m is the number of times that a particular record will be selected
in a sample drawn with replacement.
SAMPLING WITH REPLACEMENT
Using the function, it is possible to determine the probability that a
record will be selected 0, 1, 2, 3, 4, …., x times.
There are two commonly used methods to draw random samples
with replacement, which can be referred to as the ‘Pointer’ and
‘Merge’ methods. The Pointer method involves the use of pointers to a primary data set based on the generation of random
numbers. This technique is described in a paper presented by
Kimball Lewis at a quarterly BASUG meeting (Lewis, 2002). The
Merge method involves merging a data set containing randomly
generated record numbers with a primary data set. Both methods
are discussed and described in detail on SAS-L.
In practice, the probabilities change very little as sample sizes
increase from 10 to 100 to 1000 to 100,000. For samples of 1000
or more records, the probabilities are identical to three decimal
places. Dependent upon the level of precision required, probabilities could be determined for a particular sample size or an approximation could be applied. The probabilities presented in the
following section are based on a sample size of 50,000 records.
Conducting simulations using the bootstrap method, I have used
both the Pointer and Merge methods to draw thousands of samples ranging in size from 50,000 to 80,000 records. As a part of a
set of checks on the ‘randomness’ of the random number generator used to draw the samples, I included a frequency distribution
of the number of times individual cases were drawn from the
original sample. Reviewing the frequency distribution, it quickly
became obvious that the number of times that individual records
were selected from the original sample followed a followed a set
pattern. Upon closer inspection (combined with the review of
some basic statistics texts), it was clear that the number of times
that individual records were drawn from the original data set were
distributed binomially.
ASSIGNING CASE WEIGHTS
The output from the PROBBNML function is used to assign case
weights to each record in the original data file. This is accomplished using PROC FORMAT in conjunction with a random number and an assignment state.
proc format;
value wghtx
0 - .36787576 = 0
.36787576 - .73575888
.73575888 - .91970044
.91970044 - .98101307
.98101307 - .99634061
.99634061 - .99940594
.99940594 - .99991678
.99991678 - .99998976
.99998976 - .99999888
.99999888 - .99999989
.99999989 - 1 = 10;
APPLYING THE BINOMIAL DISTRIBUTION
The binomial distribution can be used to predict the probability of
occurrence of a particular event when the following assumptions
are met:
1.
2.
3.
4.
there are two possible outcomes to each event
the outcomes are mutually exclusive
the events are independent
the probability of the occurrence is constant across
events.
When those assumptions are met, it is possible to predict the
probability of particular outcomes occurring for a given set of
events.
=
=
=
=
=
=
=
=
=
1
2
3
4
5
6
7
8
9
As each record is read, a sample case weight is assigned using
the following statement:
samp_weight=input(put(uniform(0),wgthx.),2.);
In the assignment statement, uniform(0) generates a random
number between 0 and 1 from a uniform distribution, the put function converts the random number to a value between 1 and 10
1
NESUG 16
Coders' Corner
using the ‘‘wghtx’’ format, and the input function converts the
character created by the put function to a numeric variable.
CONCLUSIONS
The technique described for drawing random samples with replacement described in this paper provides an alternative approach that is adaptable to many situations and requires relatively
little programming. Without the use of macros, it allows the users
to draw multiple samples in a single pass through the data.
The case weight can then be used to compute statistics or to
control the output records, if necessary, for the random sample.
Running a PROC FREQ on the variable samp_weight will enable
you to confirm that the distribution of records selected is as expected.
One potential disadvantage with the technique, as written, is that
the user gives up some control over the size of the samples
drawn. Because the approach is based on probabilities, there will
be slight variation in the size of samples drawn.
DRAWING MULTIPLE SAMPLES
For most purposes it is necessary to draw more than a single
random sample and in some cases (e.g., bootstrap studies) it may
be necessary to draw hundreds or thousands of samples.
REFERENCES:
Using this approach, multiple samples can be drawn simultaneously by converting the variable samp_weight to an array and
using a DO loop:
Lewis, K (2002). Sampling with and without replacement using
direct access (point=option). Paper presented at the BASUG
Quarterly Meeting, June 25, 2002. Paper available at
www.basug.org.
array samp_weight(100};
do I = 1 to 100;
samp_weight{i} = input(put(uniform(0),wgthx.),2.0);
end;
ACKNOWLEDGMENTS
SAS is a Registered Trademark of the SAS Institute, Inc. of Cary,
North Carolina.
Dependent upon the purposes of the program, additional computations can be performed within the DO loop or the individual
records can be output for use in later analyses.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Dependent upon the purposes of the program, additional computations can be performed within the DO loop or the individual
records can be output for use in later analyses.
Charles A. DePascale
National Center for the Improvement of Educational Assessment
PO Box 351
Dover NH 03821-0351
Work Phone: 603 516-7900
Fax: 603 516-7910
Email: [email protected]
Web: www.nciea.org
array asamp_weight(100};
array bsamp_weight{100};
do I=1 to 100;
asamp_weight{i} = input(put(uniform(0),wgthx.),2.0);
bsamp_weight{i}= input(put(uniform(0),wghtx.),2.0);
end;
************************************************
2