Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Protection Act, 2012 wikipedia , lookup
Clusterpoint wikipedia , lookup
Data center wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Database model wikipedia , lookup
Using SAS® Data Sets to Mimic a Relational Database Greg M. Woolridge, Abbott Laboratories, Abbott Park. IL Paul R. Coe, PhD, Rosary College, River Forest, IL Abstract Data Set A Although not originally intended to be used as a database management system (DBMS), it is possible to use SAS data sets to mimic a relational database. The advanced data handling capabilities of SAS permit users to perform most tasks usually associated '",ith a DBMS. By storing data in several SAS data sets which are linked by common varialoles, end users can manage, update and analyze the data in much the same way they would if the data were stored in a database. OBS 2 3 4 VARA VARB VARC A A A B 2 HOSP ER HOSP HOSP Since every observation has a combination of values which is different from all others, each observation can be detined as unique on these variables. We have implemented this idea to store, manage and analyze survey data SAS was used for all pbases of the project, starting with reading data from flat files and progressing through data entry and correction. This paper will discuss the techniques and present some of the code we used to get the data into SAS data sets and manage it. By including the same variables in two or more data sets, these data sets can be linked. This allows for retrieval of information in a data set based on values in a second data set. As an example of this. consider data set B: Data Set B Introduction A database is generally defined as a collection of data which is usually stored on a computer. A relational database contains data which are related to one another in some way. These relationships are defined when the database is designed and form the framework for storing the data. By using common variables in several data sets which act as keys and define the relationships, SAS can be used as a DBMS to design and maintain a database. OBS VARA VARB VARD 2 A A 3 B 2 BP BP UA This data set also contains unique observations, but this time only V ARA and VARB are used as keys. Since these variables are also present in data set A. the two can be merged together to produce data set C. Definitions Data Set C SAS data sets are commonly used to store data. Two data sets can be linked to one another by including variables which are common to each data set. These variables. known as key variables. serve two purposes: (1) they function to make each observation in a data set uniquely defined and (2) they link data sets together so information can be easily retrieved. Observations are uniquely defined if each one has a combination of values in the key variables which are different from every other observation. For example. if the key variables are defined to be V ARA, V ARB and V ARC data set A might have the following values: 525 OBS VARA VARB VARC VARD I 2 3 A A A 4 A 2 HOSP ER HOSP HOSP BP BP BP UA Database Design Patient is the parent to 3 different data sets. Study Drug Administration, Visit Data and Premature Terminations. Study Drug Administration and Premature Terminations were not needed for the analysis we intended to do. but were included to make the database match our standard We have used this concept of having key variables link data sets at Abbott Laboratories to enter, store and analyze data from a study design-ad to d~termine the economic databases. This data was actually collected at the clinical benefit of a drug treatment for a spe,cific disease. study sites and is not important to our discussion. The Visit Data data set contains information on each call such In our study, data wos collected through use of a survey of patients in a clinical trial. Patients were followed for as call number, month and day of call. how many ER visi ts were reported at the call and how many hospitalizations were reported. The keys for this data set are STUDYNO, INVNO, PTNO and the call number, VDCALL. This data set is the parent or grandparenl of all remaining data sets. All the remaining data sets contain information collected in the survey and have their own keys associated with them. 12 months and contacted once a month to determine what . health care services they had used during the previous month. Specifically, each patient was asked how many times they had visited a doctor or emergency room and if they had been admitted to a hospital as well as questions about disabilities and medication use. The reason for each health care event and any tests or procedures performed was also oollected. Responses were entered into a standard questionnaire developed and maintained on a computer system. The data was sent to us once a month in the form of flat files on magnetic tape. The Emergency Room data set contains information about any visits 10 an emergency room. This includes the date of the visit and the reason for the visit. Data on the procedures perfonned during the visit also are in Ihis dala set. The keys are INVNO, PTNO, VDCALL and VWSEQITM. We needed to take the data from the survey and merge it with standard cost data to determine a total health care cost for each patient over a 12 month period. Our usual method of entry and storage of data uses the Nomad DBMS. Once data has been entered into Nomad, it is then transferred to SAS data sets with each data set created representing a single segment in the Nomad database. However, in our case, we needed to have the data ready for analysis in a shoner time frame than was possible using Nomad. We also discovered some limitations in the way our DBMS was set up which would have made it extremely difficult to build the database we needed. So we decided to enter the data directly into SAS data sets. using a database structure which is similar to that which results from a Nomad to SAS transfer. Provider Contacts contains all the information on each contact a patient had- with a h~ahh care provider during the period covered by the call. Providers include MD's, nurses~ Chiropractors, and even lab technicians. Information in this data set includes such things as the provider's name and specialty, the month and day of the contact and the reason for the contact. The keys in this data set are INVNO, PTNO, VDCALL and TWSEQITM. TWSEQITM is a counter that is incremented by 1 with each new provider contact within a call. This variable was used as a key instead of a provider identifier since a patient may have multiple contacts with the same provider and each contact is recorded separately. A complete schematic of the entire database can be found in Figure 1. The top level of the database structure is the Investigator. Since there may be any number of investigators for a study, Ihe number of observations in this data set may vary between databases. Each observation is unique because each investigator is assigned The General Study Procedures data set has 2 parents, Emergency Room and Provider Contacts. This is unusual, but was needed since many of the procedures could apply to either of the parent data sets. The procedures contained in this data set consist of both standard prompts in the a unique number, variable INVNO. at our company. The survey and free text descriptions. In addition to the parent study number variable, STUDYNO, is also included in this data set. Therefore; the keys for the Investigator data set are STUDYNO and INVNO. Investigator has only 1 child data set associated with it, Patient. This data sel contains all the demographic information on each patient in the study. This includes such things age, sex, and the patient's initials. Also included is a unique number assigned to each patient as an identifier, PTNO. The keys for this data set are STUDYNO. INVNO and PTNO. keys of INVNO, PTNO and VDCALL, there are 2 additional keys in this data set. GPITYP indicates which type of contact, Provider or Emergency Room, that the procedures are part of. GPSEQITM is another counter that takes the value of the counler TWSEQITM in Provider contacts or a similar one, VWSEQITM, in Emergency Room depending on the value of GPITYP. 526 instead of a number. The new reason W3S also placed in the Reason Flags data set and given the next consecutive number so it could be used for other events. Using this The Hospitalization data set contains the admit and discharge dales, the reason for the admission~ any procedures performed during the stay and any surgeries. The keys for this data set are INVNO, PTNO, VDCALL and HPSEQITM. method allows all the events for a single reason to be tied they are in separate data sets. This provides 311 excellent example of how SAS can be used to mimic a relational database by using common variables in multiple data sets to link the data sets together. The keys in Reason Flags are INVNO, PTNO, VDCALL and WYRNO, the reason number. together~ even if The Other Medications data set contains the name and dosing information for all prescription medications the patient took during the study. The keys for this data set are INVNO, PTNO, VDCALL, OMTYPE and OMSEQITM. From Flat File to SAS Data Set The Home Care segment contains information on assistance a patient received in their home from both paid and unpaid sources. The data contained are type of helper, number of days of care and the reason for the care. The keys for this data set are INVNO, PTNO and VDCALL. The data was received in a series of flat files on tape and several tapes were received over time. Each tape contained the calls made since the previous tape. Since patients were accrued over time, at any point in time not all patients would have had the same number of calls. Therefore. the calls were grouped into files by the number of the call. For example, when a tape was sent. the fifth call made to any patient in the time frame covered by that tape was put into a DAT AS file on the tape. The same was done for any other number call which might have been used during the time frame. This meant that we might not always receive the same files on each tape. If no baseline calls (call number 0) were made during the time frame. then there would be no DATAO file on the tape. When the files were copied onto a disk. a separate library was created for each tape received since we usually had DATA files of the same number on more than one tape. The Lost Work Days, Decreased Daily Activities and Bed Rest segments all contain information on when a patient was unable to participate in normal activities due to a medical condition. Each data set contains the reason and the number of days the patients's activities were affected. The keys for each of these data sets are INVNO, PTNO and VDCALL. The final data set, Reason Flags, is the link between all other data sets whi ch contain heath care utilization or what we call events. All segments below Visit Data, except Reason Flags, contain events as well as a reason for that event. Reason Flags attempts to link those segments together chronologically rather than hierarchically. This was one of the factors we considered when we decided to build our database in SAS instead of our usual DBMS. The databases we routinely build are structured in a strictly hierarchical manner. Since the Reason Flags segment links together several segments, we were unable to build that segment into a database using our standard DBMS. The first step when we received a tape was to get the data into an intermediate SAS data set. This was done with a OAT A step using an INFILE statement. Once the layout of the flat tape file is known, this is a relatively simple step. In our case, there were 3 different layouts possible. depending on the telephone call number. These layouts were sent in separate flat files on the tape each month. We took the first set of layout files we received and modified them to add a variable name for each field. Each time a new tape was received, the new layout files were compared with the modified files to determine if any changes had been made to the layouts. The modified layout files were then used as input into the program For every event mentioned by a patient during a telephone call a reason for the event was collected and a list of reasons for all events was compiled. Each reason was assigned a number to identify it within that call. Then each time a patient mentioned a new event, a list of the previously mentioned reasons was presented and the patient was asked if the current event was due to one of the previously mentioned reasons. If it was, then the number of the previously mentioned reason was inserted in the data set to tie that event back to an observation in the Reason Flag data set. If the event was due to a new creating the intermediate SAS data sets to create the INPUT statements. The code to accomplish this is presented in figure 2. Looking at the code in Figure 2 you can see that macro processing is used extensively. Four macro variables are created with %LET statements (not shown here) and are defined as: reason, the reason was inserted in the event data set 527 MAXDATA MAXTAPE MINDATAx MAXDATAx- the largest DATAx file to be read the number of tapes being used the first DATAx file to be read in a specific tape library the last DAT Ax file to be read in a specific tape library variables and a permanent data set, STUDY.VW, is created. In the actual code we used, all of the data sets for the database were created in the same data step using code similar to the code for Emergency Room. Once all of the individual data sets were created, we started to code the procedures. What we vlanted was a Each tape received has a number files on it, One file for each set of phone calls made the previous month. The files are designated DATAO - DATAI2, with 0 being a baseline call. The files are copied from the tape into a library on the VM mainframe by using a file type of TAPEx, where x indicates which tape it is (i.e. first tape, x=l; second tape, :<=2; etc.). standard CPT4 code for each procedure, both prompted and free text, so standard costs could be applied. The prompted procedures were easy to code. In order to code the free text procedures, a list of aU procedures was printed and submitted to our Medical department They assigned 1-6 codes for each procedure and returned the list to us. The code uses a %DO-%TO loop to process each DAT Ax file individually. The macro first decides which layout file, VARMAPO, VARMAPI or VARMAP3, is appropriate to use for processing each flat file and sets a macro variable, FOR<'\1, that can be used in the rest of the code. The information from the layout file is then read into a SAS data set. A DATA _NULL_ step is used to create a series of macro variables which contain the variable name, type, position and label for the data to be read from each DATAx file. At the same time the list was generated for our Medical department, a SAS data set was created which contained all the procedures and vanables to hold all the codes. When the list was returned, the codes were entered int!) the variables using data entry screens developed with SAS!AF. When the data was analyzed, the file with the codes was merged with the database by the procedure so the CPT4 codes were available. The next step is to read the DATAx file and create a SAS data set with I observation for each patien!. This is done for each tape received, and then all the files for each type of call are brought together into a single SAS data set. Since each patient is being called only once per month, the end result should be a series of SAS data sets, one for each call, each containing 1 observation per patient. The final data sets are named SUGI.DATA&M where &M is the number of the phone call. By storing data in several SAS data sets which are linked by common variables. end users can manage, update and analyze the data in much the same way they would if the data were stored in a database. We have created such a database at our company. By merging costs associated with events stored in the database, our end users are able to analyze the economic benefits associated with the drug treatment. SAS was used for all phases of the project, starting with reading data from flat files and progressing through data entry and correction. SAS also allowed us to define relationships in the data which could not be easily defined using our standard DBMS. Conclusion The code in Figure 3 is a partial listing of the code that takes the SUGI.DATA&M data sets and makes the individual data sets that make up the database. The macro SETDATA makes it easier to deal with varying numbers of input data sets. Since some of the earlier tapes we received did not have calls all the way through call 12, we did not always have 13 input data sets. Using macro code makes it easier to deal with lhis situation since you need only make 1 change near the top of the program. The data step where EMROOM is created sets all of the SUGI.DATA&M data sets together. Since the data sets have multiple Emergency Room visits per observation, these need to be broken out into separate observations in the database. The DO - TO loop does this using array processing to create the variables for the output data set and then using an OUTPUT statement at the bottom of the loop. Once this is done, the data set is sorted by the key SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. @ indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 528 FIGURE 1 SCHEMATIC OF DATABASE INVESTIGATOR I PATIEl-IT STUDY DRUG ADMINISTRATION HOSPITAIlZATJON E.MERGENCY PROVIDER ROOM CONTACTS L,-J GENERAL STUDY PROCEDURES I VISIT DATA HOME CAFE IDST DECP.EASED WORK DAILY DAYS ACTIVITIES REASON FLAGS 529 I PREMATURE TERMINATIONS BED FIST OTHER MEDICATIONS FIGURE 2 %MACRO IN: %00 M=O %TO &M.6..XDATA: %IF &M=O %THEN '%LET FORM==V ARMAF'Cr. %ELSE %IF &M=l OR &.M=2. %THEN %I.ET FORM=VARl-.1APl; %ELSE %LET FORM==VARMAP3; CMS Fl FlLFORM DISK &FORM SAS A; DATA FlLFORM; IN"FILE FILFORM MISSOVER; LENGTH POS 19; INPUT VAA $ PUS S LEN $ TYPE S SASVAR $; DATA _NULL_; SET FILFORM END=EOF; IF TYPE=' A' TIlEN CHAR='$'; CALL SYMPUTCV'IITRIM(LEFTLN_ll,SASVARl; CALL SYMPUTCL'lrrRlMtLEFTLN_»),V ARl; CALL SYMPUT(,PIJTRlMtLEFTCN-»,POS); CALL SYMPUTCTIITRlMtLEFTCN_»,CHAR); IF EOF THEN CALL SYMJ'IJT('N',_N_); %DO J=I %TO &MAXTAPE~ %IF ,&.&MINDATA&J I.E &M) AND(&M LE &&MAXDATA&J) %THEN %00; eMS F1 INDATA&! DISK DATA&M T APE&J D: DATA INDATA&J: lNFlLE INDATA&J M1SS0VER; VDCALL=&M: %00 1=1 %TO &N~ INPUT && V&1 &&T &1 &.&.P&I @: LABEL &&V&I;"&&L&I"; %END; %END~ %END; DATA INDATA: SET %DO J=1 %TO &MAXTAPE; %Jf (&&MINDATMJ LE &M) AND (&M LE &&MAXDATA&.J) %TIlEN INDATA&J: %END; PROC SORT: BY INYNO PTNO; DATAINDATA SETINDATA; BY INVNO PTNO; IF NOT(FIRST.PTNO) OR NOT,LAST.PTNO) THEN PUT lNVNO= PTNO= i'TMPRNO= fIRST.PTNO= LAST.PTNO=; DATA SUGLDATA&M; SETINDATA; %END~ %MENDIN~ 530 FIGURE 3 %!"lACRO SETDATA; SET %DO M=O %TO 12: SUGI. DATA&M %END; : %MEND SETDATA: DATA EMROOM (KEEP=INVNO PTNO VDCALL VWFFN VWDTMO VWDTDY VWREAS VVlBPH V'iIHY VWrF VWDDE VWSEQITM); %SETDATA ***** UNROLL EMERGENCY ROOM LOOP ARRAY EMFFN {'} VWFFNI-VWFFN4: ARRAY EMDTMO {*} VWDTM01-vwDTM04: ARRPX EMDTDY {*} VWDTDY1-VWDTDY4: ARRAY EMREP.s ( *) VWREAS 1 ~ VWREAS4 : ARRAY EMBPH {*} VWBPH1-VWBPH4; ARRAY EMHY {*J V'tlHY1-VWHY4: ARRAY EMIF {*} VWIF1-VWIF4: ARRAY EI1DDE { *} VWDDE 1-VWDDE 4 : DO r = 1 TO 4: IF EMREAS {I} NE • • THu~ DO: VWFFN = EHFFN {I}; VWDTMO = EMDTMO {I}; VWDTDY = EMDTDY {I}; VWREAS = EMREAS { I J ; VWBPH = EMBPH II}: VWHY = EMHY {I}; VVlIF = EMIF {I}; VWDDE = EMDDE II l; VWSEQITM = I; OUTPUT El-1ROOM; END; END; " ••••• , ' , ••• " , . MAKE EMERGENCY ROOM SEGMENT PRoe SORT DATA=EMROOM; BY I~JNO PTNO VDCALL VWSEQITM; DP.TA STUDY. V'iI; SET EMROOM; BY INVNO PTNO VDCALL VVISEQITM; 531 (VW) * ".1<1<:1<**:1<*"****:1<-1<****.,