Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PROWCING SAS FILES FROM LARGE MASTER FILES FOR A CLINICAL RESEARCH PROJECT Frank E. Harrell, Jr.,Lipids Research Clinics Program,.University qf North The Prima~ P~evention Trial of the Lipids Clinics Prog~am is & cl~nical intervention trial Q£ almost 4000 gubjeeta~ gathering ReBe$r~h data for thousands of variables On each.fiubject. , c The aubjects each sttent 60 visits to their clinic. Visits 1-5 are baseline vl~lte and data forms for these visits are dissimilar. After visit 5, clinic visits consist of only three types: 2 month, 6 month, &nd ye~rly visits. So for visits 6-60~ the vls1,t number and one of three system item numbers completely define each variable for purposes of retrieval of data for statistical analyses. For visits 1-5, five· system itern numbers describe the address of each variable~ Although data forms vary Be'ross clinic visits, there is much overlapping of variables between visits tn termS of definitions, units of IDeasurement_ and input formats. Therefore, we can reduce the problem of-data retrieval to its simp1ist terms by settin,g up a dictionary containing item numbers and deacriptions _of variables. I~ our Master ,data fil~s, every item has stored with it a 1-byte status field deBcribing the quality of the item. These are out'f~d in the following table: Figure 1 - Meanings Status of Item Status Meaning naffit'mative ll • Raeod.ing th~ blank with. E ailOWB distinguishing hetween rlnegative" and 11fQrm not arrived yet ll • For character variables, of Which there &re very few on our forms, value$ are merely changed to blank if the corresponding statuses are less than 5. This could be modified to recode the values to special characters instead of blanks. Spacial programs are neQded t~ retrieve data for analysts from the m4ster file because the data are eucTypted and because only a subset of the thouSQndB of variables for each subject are deSi~ed for analySiS. AnalyuiB file Bub settiog is initiated by listing the'mnemon1ce and visit number a of desired-1t-ems~ These are linked with an item dictiQnary~ There are two separate item dictionaries - one: for visits ·1-5 and -One for visits 6-60. The item dictionary defines the system item numbers, decimal places, label, and answer type· for categorical items, given the mnemonic base. The final SAS va.riable name 16. formed hy concAtenAting. the visit number (without leading blanks or ze-r_os) to thf! mnemonic baae. If the item number field on the dictiana~y is blank for 4 particular visit, the variable is undefined f~ that viSit (i.e. ia not on Bny fo1;"m fer that visit) and hence is not retrieved. The dictionary for visits 1-5 iB described in Figure 2: BAS Special Figure 2 - Layout of Item Dictionary Missina: Value Blank. Form containing item not Columns 1 4rrived to sys t-em Fo~ containing ltem arrived 1-5 ~A but item not obtainable 2 Item has an edit error .B (e.g. out of valid range) 4 Item inconsistent with .n .!mother 'item. 6 or 9 Item has good status but is ~E blank or not valid numeric 6 Valid numeric. field,no errorlil>.Z Le. a valid value 9 Item verified by clinic 'after).2'_1.e. a initially failing an edit valid value or consistency check and is a valid numeric field 8ecause calculations are to be performed only on non-missing. non-eTroneous data, we have found that it is not wort~hile to carry statuses to the file used in analyses, but only to ~ecode variablea which have b~d status (status < 5). For numeric variableB~ the special mdsBing valueB in SAS allow for recoding of bad it~ $uch that the reason an item was not clean will be known. The ~pecial missing values used for such recoding are displayed in Figure 1. For example, a 3-byte numeric value having a IiItatuB of 2 1a recoded with 'B 'by the analysis file retrieval program so that SAS will input the variable as .E if a MISSINC statement i9 given. The mi.saing value .E is used principally f.or multiple-choice !,.tems in whioh ~ blank apswer means "negat1vell and "a non-hiaok answer' means . . 98 Ca~olina 6-30 31 32-71 80 Contents Mnemonic for item ~ to become SAS name afte~ adding visit numbar Five system item numbers for visits 1-5 which define the location of the item on the master file Number of places to right of decimal pt Blank if integer or character DescrIption of item - ~o become SAS label An~er type fo~ item~ if catego~tcal Definea r'value labels" Example of portion of dictionary: CHOL i841823566 1418 1193 1803 CHOLESTEROL LDL 1860623725 1606 1371 2077 ESTI!OCI:ED LDL ••• VLDL 23871 VLDL CHOLESTEROL ill ~9 mI=,~ lIT 570214B525473258332627411iEIGill, KG MBRTH 281 MONTH OF BIRTH DBRTH 992 DAY OF BIRTH YBRTH 998 YEAR OF BIRTH EVMAR 320 EVER MARRIED1 MCE 21288 EDUC 21235 EDUCATION OCCUP 21303 OCCUPATIO!>! CORNA 22360 CORNEAL ARCUS1 M R E 0 A The retrieval program outputs 8 file suitable -for reading in a SAS program.,. automatically comput,ing: locations in the ClutPUt record for ~ach . item.- Use"" is· made of a separate system t;able not described here which contains the type (numeric or ,character) and length of each item~ given the item number. The following statements give a simple example of creating a SAS file from one of our master file.s= Figure 3 Example of Craating SAS File II EXEC FETCH - subset master file. Read dictionary to get item numbers for mnemonics and visit numbers listed below. IIINPUT DD - mao tor file I/OUTPUT DD DSN=SUBSET, ••• IIITEMS DD DSN=ITEMS,DISP=(NEW,PASS), ••. - will have one record for every item at each visit containing label, decimal places, etc., and position of item in extracted file AGE 1 3 wr 1 2 3 HT 1 RACE 1 SEX 1 SGOT * 1 4 (visits 1-4 inclusive) (I EXEC SASDESF produces SAS macroa for reading and describing file IIITEMS DD DSN-ITEMS •..• I/OUTPUT DP DSN=MACROS, •.•• II EXEC SAS IlsAVE DD - disk dataset for saving SAS file IIIN DD DSN=SUBSET, •.. IISYSIN DD DSN=MACROS, .•• II DD DATA SAVE.ONE; INFILE IN; MISSING ABeDE; LABEL IO=PALIENT ID LABELF; LENGTH LENGTHF; FORMAT PORMATF; INPUT ID $ 1-9 INPUTF; .•. statements to compute derived values ..• ••. LABEL,LENGTH,FORMAT for der1~~d values .•• * , i ~: We also have occasion to run a program before FETCH which splits lists of mnemonics by the mnemonic base type so that multiple SAS datasets are c~eated. This is useful when there are more than 1024 variables to analyze (the maximum SAS allows) - the variables can be split u~ into 2 or mure logical groups and 2 or more SAS datasets can be created by executing SASDESF and SAS more than once. Answer types for categor~cal cariables are defined by Ilnking the ITEMS file (figure 3) with an external "ANSWERt' file. Since SAS does not yect have value labels, program SASDESF merely prints descriptions -of answers as c"on:anents in the user's SAS program for documentation. The program builds macros for specifying LABEL~ LENGTH, INPUT. and FORMAT statements. These are exemplified in Figure 4 below:; Figure 4 - Comments and Macros Produeed by program "SASDESF" in example. in Figure 3. *-----------------------------------------------* VARIABLE ANSWERS RAGE SEX I=WHITE 2-BLACK 3=ORIENTAL 9=OTHER l=MALE 2=FEMALE *-~~--------------------------------------------; PAGE; MACRO INPUTF @10 AGEl 2. @12 AGE3 2. @14 WT1 4.1 @18 WT2 4.1 @22 WT3 4.1 @26 HT1 4.1 @30 RACEl 1. @31 SEX! 1. @32 SGOT1 6.2 @38 SGOT2 6.2 ••. SGOT4 6.2% MACRO FORMArF AGEl 2. AGE3 2. WT1 5.1 WT2 5.1 WT3 5.1 HT1 5.1 RACE1 1. SEX! 1. SGOTl 7.2 ... SGOT4 7.2% MACRO LENGTHF AGEL 2 AGE3 2 wrl 3 WT2 3 WT3 3 HTI 3 RACEI 2 SEXl 2 SGOTl 4 ••• SGOT44% MACRO LABBLF AGE1='AGE IN YEARS' AGE3='AGE IN YEA RS t WTl=IWEIGHT~ KG I WT2~rWEIGHT, KG I WT3~JWEIGHT , KG' Ktl~'HEIGHT, CM'% (Others had blank label fields in dictionary since their mnemonics were self-explanatory). PAGE; The labels were retrieved from the item dictionary aD~ put in macro LABELF. The lengths generated for the LENGTH statement are the minimum number of bytes allowing for n significant digits where n is the number of columns for the numeric field on the input reeord. Out of all the variables in our data. the only one requiring more than" 4 bytes of" floating binary storage is social security number. so considerable savings of disk space can be -achieved by computing the minimum length for each variable and not using the default length of S bytes , which allows for l6 significant digits. The components of the INPUT statement are gotten from the ITEMS file. These include CQl~ locations on the sub8etted analysis file, types (characte~ or numeric) and number of places after the decimal point for numeric it~. The output formats for numeric variables for the FORMAT statement are taken as the input specification for integer data (e.g. FORMAT X 2.) or k.d when the input specification was m.d~ where k=m+l. The main reason for using output formats is that for non-integer data such as 33.01, SAS may print the value as 33".0099770, for example"~ when the length of the stored variable is less than 8 bytes and the implied format of the variable has been over~ ridden by placing the variable on the left hand side on an equals sign. This problem is due to the truncation problem described in the SAS manual. Giving the variable an output format such as FORMAT Y 5.2 causea the variable to always be printed out by SAS programs or procedures in a nice format. After statements for reading and describing" a SAS file. statem~nt8 computing derived values ean appear. Due to the propagatio~ of missing values in any algebraic stateme.nt, ne~ variables thus created are automatically recoded if any of the component variables have bad sta~us (special misaing value). After this file is built, we have to take care of some problems which are present with the type of data we have. For instance, subjects can transfer between clinics, so thst the master files may have partial records for the subject from each clinic. Each record will have the same subject ID code. We wish to pool all available data for each unique subject ID into a single T6cord. This is e:asily handled uaing the UPDATE statement and a small auxiliary file containing a list of all unique IDls. After this problem 1s handled~ other operations can take place, such ~s merging baseline data with da~a for later visits. The visit number at the end of the variable n~es makes them unique across visits. We run a procedure written by UH, DAIACHK~ on the completed file. This p~ocedure p~ints, for every numeric variable in the file~ the number of a 99 special missing values broken down by .,.A-.B, and the lo~eat and hlgheat 5 distinct v41u~s of the variable. This is useful in checking the qualiey of the dataset. If errors are found, variables are corrected by running a small SAS program. We use- tha value .C to denot~ bad data found at analysis time-. When all this is done, running PROC gives the user complete documentation of the file, a step which used to be very time consuming. ~en reading large SAS datasets, we have found the following practices to save money in Le~ Qf 110 time. 1. When subsetting files, using OPTIONS GEN=Q can decrease the number of 1/0 operat~ons since the large amount of source code and variable descriptions and history information are not copied CONTENTS each titILe~ 2, Using KEEP~ in the SET statement, e.g. DATA NEW;SET SAVE.OLD(KEEP=X Y Z••• )~ can increase efficiency ~hen 8ubsett~ng a small proportion of variables from a file with many va~iablea. This is more efficient than using a separate KEEP statement after the DATA statement. This work was supported by U.S. National Heart, Lung, and Blood Institute contract number NHLI-NIH-71-2243 from National Institute~ of Health. f r I: I I 100