Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NESUG 16 Coders' Corner CC001 Drawing Multiple Random Samples with Replacement Charles A. DePascale, Center for Assessment In the case of drawing random samples with replacement, all of the assumptions are met. As each record is drawn for the sample, a record is either selected or not selected each time and the probability that a particular record will be selected is always 1 divided by the number of records in the original sample. ABSTRACT Using SAS, there are several efficient methods to draw random samples with and without replacement. This paper presents a method for drawing multiple random samples with replacement from a data file in a single pass through the data. Therefore, based on the binomial distribution, it is possible to determine when drawing a sample with replacement from a data set of 50,000 records the probability that a record will be drawn 0 times, 1 time, 2 times, etc. Using those probabilities, it is possible to draw a random sample with replacement by applying the selection probabilities to each record in the original data set. The approach is based on an application of the binomial distribution when sampling with replacement and uses the PROBBNML function to estimate the probability that a record will be selected a certain number of times. INTRODUCTION DETERMINING CASE WEIGHTS The purpose of this paper is to provide an alternative technique for drawing random samples with replacement from a single data set or population. The first step in the process is to determine the selection probabilities. The PROBBNML function computes the probability that an observation from a binomial (n,p) distribution will be less than or equal to m. The syntax of the function is PROBBNML(p,n,m) where, in this case, Applying the basic technique to draw a single sample requires only the use of Base SAS and an understanding of PROC FORMAT, the use of the PUT and INPUT functions, and a random number generator. Extending the technique to draw multiple samples requires only the use of an array and DO loop. n is the number of records in the original group and sample, p, the probability of selection for each record, is 1/n, and m is the number of times that a particular record will be selected in a sample drawn with replacement. SAMPLING WITH REPLACEMENT Using the function, it is possible to determine the probability that a record will be selected 0, 1, 2, 3, 4, …., x times. There are two commonly used methods to draw random samples with replacement, which can be referred to as the ‘Pointer’ and ‘Merge’ methods. The Pointer method involves the use of pointers to a primary data set based on the generation of random numbers. This technique is described in a paper presented by Kimball Lewis at a quarterly BASUG meeting (Lewis, 2002). The Merge method involves merging a data set containing randomly generated record numbers with a primary data set. Both methods are discussed and described in detail on SAS-L. In practice, the probabilities change very little as sample sizes increase from 10 to 100 to 1000 to 100,000. For samples of 1000 or more records, the probabilities are identical to three decimal places. Dependent upon the level of precision required, probabilities could be determined for a particular sample size or an approximation could be applied. The probabilities presented in the following section are based on a sample size of 50,000 records. Conducting simulations using the bootstrap method, I have used both the Pointer and Merge methods to draw thousands of samples ranging in size from 50,000 to 80,000 records. As a part of a set of checks on the ‘randomness’ of the random number generator used to draw the samples, I included a frequency distribution of the number of times individual cases were drawn from the original sample. Reviewing the frequency distribution, it quickly became obvious that the number of times that individual records were selected from the original sample followed a followed a set pattern. Upon closer inspection (combined with the review of some basic statistics texts), it was clear that the number of times that individual records were drawn from the original data set were distributed binomially. ASSIGNING CASE WEIGHTS The output from the PROBBNML function is used to assign case weights to each record in the original data file. This is accomplished using PROC FORMAT in conjunction with a random number and an assignment state. proc format; value wghtx 0 - .36787576 = 0 .36787576 - .73575888 .73575888 - .91970044 .91970044 - .98101307 .98101307 - .99634061 .99634061 - .99940594 .99940594 - .99991678 .99991678 - .99998976 .99998976 - .99999888 .99999888 - .99999989 .99999989 - 1 = 10; APPLYING THE BINOMIAL DISTRIBUTION The binomial distribution can be used to predict the probability of occurrence of a particular event when the following assumptions are met: 1. 2. 3. 4. there are two possible outcomes to each event the outcomes are mutually exclusive the events are independent the probability of the occurrence is constant across events. When those assumptions are met, it is possible to predict the probability of particular outcomes occurring for a given set of events. = = = = = = = = = 1 2 3 4 5 6 7 8 9 As each record is read, a sample case weight is assigned using the following statement: samp_weight=input(put(uniform(0),wgthx.),2.); In the assignment statement, uniform(0) generates a random number between 0 and 1 from a uniform distribution, the put function converts the random number to a value between 1 and 10 1 NESUG 16 Coders' Corner using the ‘‘wghtx’’ format, and the input function converts the character created by the put function to a numeric variable. CONCLUSIONS The technique described for drawing random samples with replacement described in this paper provides an alternative approach that is adaptable to many situations and requires relatively little programming. Without the use of macros, it allows the users to draw multiple samples in a single pass through the data. The case weight can then be used to compute statistics or to control the output records, if necessary, for the random sample. Running a PROC FREQ on the variable samp_weight will enable you to confirm that the distribution of records selected is as expected. One potential disadvantage with the technique, as written, is that the user gives up some control over the size of the samples drawn. Because the approach is based on probabilities, there will be slight variation in the size of samples drawn. DRAWING MULTIPLE SAMPLES For most purposes it is necessary to draw more than a single random sample and in some cases (e.g., bootstrap studies) it may be necessary to draw hundreds or thousands of samples. REFERENCES: Using this approach, multiple samples can be drawn simultaneously by converting the variable samp_weight to an array and using a DO loop: Lewis, K (2002). Sampling with and without replacement using direct access (point=option). Paper presented at the BASUG Quarterly Meeting, June 25, 2002. Paper available at www.basug.org. array samp_weight(100}; do I = 1 to 100; samp_weight{i} = input(put(uniform(0),wgthx.),2.0); end; ACKNOWLEDGMENTS SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina. Dependent upon the purposes of the program, additional computations can be performed within the DO loop or the individual records can be output for use in later analyses. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Dependent upon the purposes of the program, additional computations can be performed within the DO loop or the individual records can be output for use in later analyses. Charles A. DePascale National Center for the Improvement of Educational Assessment PO Box 351 Dover NH 03821-0351 Work Phone: 603 516-7900 Fax: 603 516-7910 Email: [email protected] Web: www.nciea.org array asamp_weight(100}; array bsamp_weight{100}; do I=1 to 100; asamp_weight{i} = input(put(uniform(0),wgthx.),2.0); bsamp_weight{i}= input(put(uniform(0),wghtx.),2.0); end; ************************************************ 2