Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tw wo Methods to Merge e Data onto o Every Ob bservation in Anothe er Dataset Lisa Mendez z, Ph.D., Kn nowesis, Incc., San Anttonio , TX Kim Bru unnert, Ph.D., Elsevier, Houston,, TX ABSTRAC CT There are times t when you just can’t see em to find a PR ROC that will do o exactly what you want. We e came across a scenario where w we neede ed to calculate the mean of a student data fiile and then fla ag student obse ervations that w were more than three standard d deviations from the mean. We W came acrosss two methods to do what w we needed to do o. od uses a comb bination of Data a steps and Prrocs, and utilize es the If _N_ th hen set method d. Another metthod One metho utilizes Pro oc SQL. This paper p will outlin ne both method ds step-by-step p and illustratess two different w ways to do the same thing. Pers sonal preferenc ce dictates whiich method to use. u INTRODU UCTION We have a student data set s that identifie es students by ID number acrross two years. Each studen nt has two time values (Tim me1 & Time2). We want to ca alculate the me ean of all of the e students’ time e values and fla ag those stude ents whose time e values fall ab bove or below three standard deviations from m the mean. RAW DAT TA The raw da ata set has five e variables and 25 observation ns. The variab bles are: Student, Semester, Y Year, Time1, a and Time2. Ea ach student has s a time value, which represents time in secconds. Figure 1. Raw Data. 1 USING TH HE IF _N_=1 SET METHO OD Essentially y, the If _N_=1 Set method se ets the values of o a one-record d dataset to all other observattions of another dataset. Figure 2. Illustration of how one observation is set to eve ery observatiion in anothe er dataset. To create the t one-record dataset that we w will use, we need to start b y calculating th he means of the student dataset. We want to o calculate the mean for both the Time1 and d Time2 variablles. We will usse the output out = statementt to create an output o dataset. Sample SAS S Code and d Output Data aset Next we wiill drop the non n-essential varia ables. We only y the need the _STAT_ variables, so we dro op the _TYPE_ _ and _FREQ_ variables. Sample SAS S Code and d Output Data aset 2 Once we get g the mean an nd standard de eviations for both Time1 & Tim me2 we will nee ed to transpose e the data to ge et the data into a format that we e can use to ca alculate three sttandard deviatiions above and d below the me eans. Sample SAS S Code and d Output Data aset At this poin nt we can calcu ulate three (3) standard s deviations above an nd below the m means. Sample SAS S Code and d Output Data aset Now we ha ave the four variables we need: sd3minus an nd sd3plus at T Time1 and Tim me2. Now, to put those valuess in the original dataset we firrst split the new w data into two datasets: one with 3SD ABO OVE the mean ffor Time1 and Time2, and d one with 3SD D BELOW the mean m for Time1 1 and Time2. To create the t 3SD datase et ABOVE the mean, m we will create c a datase et with only the e Time variable es and the 3SD D ABOVE va alues, transpose e the data, and d then rename variables and d drop non-essential variables. 3 Sample SAS S Code and d Output Data aset We will do the same steps to create the 3SD BELOW dataset. Sample SAS S Code and d Output Data aset 4 Now we arre ready to merrge the data tog gether by using g the If _N_=1 Set syntax. Re eading the syn ntax below we a are stating: if th he number of observations o fo or the first datas set (Student_sd dplus_t2) equa als one (1) then n set that observation n with every ob bservation in th he Student_data dataset locatted in the SASPaper library, a and output the file to a dataset named n Studentt_SD3_plus_minus1. Sample SAS S Code and d Output Data aset 5 Notice thatt every observa ation in the orig ginal Student_D Data dataset ha as the same 3S SD Above valu ues for Time1 and Time2 merrged with it. w merge the 3S SD Below values. We will now 6 e 3SD values Above A and 3SD values Below for the Time1 and Time2 varriables. Now w we can The final fille has both the flag the observations thatt have either a Time1 or Time e2 value above e or below three e standard devviations from the e can create a dataset d with all of the observa ations and flagss, but we can a also create a dataset with only mean. We flagged observations and d another datas set with un-flagged observatio ons. Sample SAS S Code and d Output Data aset 7 All Studen nt_Data observ vations with Flag F variable Only Student_Data obse ervations with h Flag variable e equal to 1 (F Flag variable h has been drop pped) Only Student_Data obse ervations with h Flag variable e NOT EQUAL L to 1 (Flag va riable has bee en dropped) 8 USING PR ROC SQL Using PRO OC SQL to achieve the same results can elim minate multiple e steps; howevver, many peop ple are reluctant to use PROC C SQL if they arre unfamiliar wiith it. As with the e If _N_=1 Set method, we mu ust first begin by b calculating tthe means and standard deviation for the Time1 and Time2 variables. In the t SAS Code example below w, note that the e format statem ment is not nece essary, but it h helps to make the values more readable. Sample SAS S Code and d Output Data aset Next we wiill use a data sttep to compute e three (3) standard deviation s above and be elow the mean ns, flag the observation ns, and write out the three diffferent datasets s. We will also o eliminate non-essential varia ables for each output dataset spe ecified. 9 Sample SAS S Code and d Output Data aset All Studen nt_Data observ vations with Flag F variable Only Student_Data obse ervations with h Flag variable e equal to 1 (F Flag variable h as been drop pped) 10 1 Only Student_Data obse ervations with h Flag variable e NOT EQUAL L to 1 (Flag va riable has bee en dropped) Each meth hod yields the exact e same datta sets and results. It is up to o the user to de etermine which method he or she wants to us se. Neither one method is be etter or worse than the other. Some people may be more ccomfortable ussing the data steps and PROC Cs, while others s are more com mfortable and fa amiliar with PR ROC SQL. CONCLUSION It is always s difficult to figu ure out ways to o do things that are not comm mon. We were a able to figure o out how to utilizze the IF _N_=1 Set S method by using the data step and PRO OCs that we we ere familiar with h. We decided to tackle the ssame issue using g PROC SQL and a illustrate bo oth ways to acc commodate ma any SAS userss. Each method d will yield the same results. As s long as the re esults are accurate, either me ethod will work. REFEREN NCES CONTACT INFORMAT TION Your comm ments and ques stions are value ed and encoura aged. Contactt the authors att: Kim K Brunnert Elsevier 11 1011 Richmond d Ave, Ste 450 0 Houston, TX 77 7042 Phone: 713-346-6984 E-mail: k.brunne [email protected] dez Lisa Mend Knowesis,, Inc. San Anton nio, TX Phone: 20 02-709-8932 ext 231 E-mail: lm mendez@know wesis-inc.com SAS and all a other SAS In nstitute Inc. product or service e names are re gistered tradem marks or tradem marks of SAS Institute Inc c. in the USA and a other counttries. ® indicate es USA registrration. Other b brand and product names are trademarks s of their respective companie es. 11 1