Download Using SAS Data Sets to Mimic a Relational Database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Clusterpoint wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Database model wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Using SAS® Data Sets to Mimic a Relational Database
Greg M. Woolridge, Abbott Laboratories, Abbott Park. IL
Paul R. Coe, PhD, Rosary College, River Forest, IL
Abstract
Data Set A
Although not originally intended to be used as a database
management system (DBMS), it is possible to use SAS
data sets to mimic a relational database. The advanced
data handling capabilities of SAS permit users to perform
most tasks usually associated '",ith a DBMS. By storing
data in several SAS data sets which are linked by common
varialoles, end users can manage, update and analyze the
data in much the same way they would if the data were
stored in a database.
OBS
2
3
4
VARA VARB VARC
A
A
A
B
2
HOSP
ER
HOSP
HOSP
Since every observation has a combination of values
which is different from all others, each observation can be
detined as unique on these variables.
We have implemented this idea to store, manage and
analyze survey data SAS was used for all pbases of the
project, starting with reading data from flat files and
progressing through data entry and correction. This paper
will discuss the techniques and present some of the code
we used to get the data into SAS data sets and manage it.
By including the same variables in two or more data sets,
these data sets can be linked. This allows for retrieval of
information in a data set based on values in a second data
set. As an example of this. consider data set B:
Data Set B
Introduction
A database is generally defined as a collection of data
which is usually stored on a computer. A relational
database contains data which are related to one another in
some way. These relationships are defined when the
database is designed and form the framework for storing
the data. By using common variables in several data sets
which act as keys and define the relationships, SAS can be
used as a DBMS to design and maintain a database.
OBS
VARA VARB VARD
2
A
A
3
B
2
BP
BP
UA
This data set also contains unique observations, but this
time only V ARA and VARB are used as keys. Since
these variables are also present in data set A. the two can
be merged together to produce data set C.
Definitions
Data Set C
SAS data sets are commonly used to store data. Two data
sets can be linked to one another by including variables
which are common to each data set. These variables.
known as key variables. serve two purposes: (1) they
function to make each observation in a data set uniquely
defined and (2) they link data sets together so information
can be easily retrieved. Observations are uniquely defined
if each one has a combination of values in the key
variables which are different from every other
observation. For example. if the key variables are defined
to be V ARA, V ARB and V ARC data set A might have
the following values:
525
OBS
VARA VARB VARC VARD
I
2
3
A
A
A
4
A
2
HOSP
ER
HOSP
HOSP
BP
BP
BP
UA
Database Design
Patient is the parent to 3 different data sets. Study Drug
Administration, Visit Data and Premature Terminations.
Study Drug Administration and Premature Terminations
were not needed for the analysis we intended to do. but
were included to make the database match our standard
We have used this concept of having key variables link
data sets at Abbott Laboratories to enter, store and analyze
data from a study design-ad to d~termine the economic
databases. This data was actually collected at the clinical
benefit of a drug treatment for a spe,cific disease.
study sites and is not important to our discussion. The
Visit Data data set contains information on each call such
In our study, data wos collected through use of a survey
of patients in a clinical trial. Patients were followed for
as call number, month and day of call. how many ER
visi ts were reported at the call and how many
hospitalizations were reported. The keys for this data set
are STUDYNO, INVNO, PTNO and the call number,
VDCALL. This data set is the parent or grandparenl of
all remaining data sets. All the remaining data sets
contain information collected in the survey and have their
own keys associated with them.
12 months and contacted once a month to determine what
. health care services they had used during the previous
month. Specifically, each patient was asked how many
times they had visited a doctor or emergency room and if
they had been admitted to a hospital as well as questions
about disabilities and medication use. The reason for each
health care event and any tests or procedures performed
was also oollected. Responses were entered into a
standard questionnaire developed and maintained on a
computer system. The data was sent to us once a month
in the form of flat files on magnetic tape.
The Emergency Room data set contains information about
any visits 10 an emergency room. This includes the date
of the visit and the reason for the visit. Data on the
procedures perfonned during the visit also are in Ihis dala
set.
The keys are INVNO, PTNO, VDCALL and
VWSEQITM.
We needed to take the data from the survey and merge it
with standard cost data to determine a total health care
cost for each patient over a 12 month period. Our usual
method of entry and storage of data uses the Nomad
DBMS. Once data has been entered into Nomad, it is
then transferred to SAS data sets with each data set
created representing a single segment in the Nomad
database. However, in our case, we needed to have the
data ready for analysis in a shoner time frame than was
possible using Nomad.
We also discovered some
limitations in the way our DBMS was set up which would
have made it extremely difficult to build the database we
needed. So we decided to enter the data directly into SAS
data sets. using a database structure which is similar to
that which results from a Nomad to SAS transfer.
Provider Contacts contains all the information on each
contact a patient had- with a h~ahh care provider during
the period covered by the call. Providers include MD's,
nurses~
Chiropractors, and even lab technicians.
Information in this data set includes such things as the
provider's name and specialty, the month and day of the
contact and the reason for the contact. The keys in this
data set are INVNO, PTNO, VDCALL and TWSEQITM.
TWSEQITM is a counter that is incremented by 1 with
each new provider contact within a call. This variable
was used as a key instead of a provider identifier since a
patient may have multiple contacts with the same provider
and each contact is recorded separately.
A complete schematic of the entire database can be found
in Figure 1. The top level of the database structure is the
Investigator.
Since there may be any number of
investigators for a study, Ihe number of observations in
this data set may vary between databases.
Each
observation is unique because each investigator is assigned
The General Study Procedures data set has 2 parents,
Emergency Room and Provider Contacts. This is unusual,
but was needed since many of the procedures could apply
to either of the parent data sets. The procedures contained
in this data set consist of both standard prompts in the
a unique number, variable INVNO. at our company. The
survey and free text descriptions. In addition to the parent
study number variable, STUDYNO, is also included in
this data set. Therefore; the keys for the Investigator data
set are STUDYNO and INVNO. Investigator has only 1
child data set associated with it, Patient. This data sel
contains all the demographic information on each patient
in the study. This includes such things age, sex, and the
patient's initials. Also included is a unique number
assigned to each patient as an identifier, PTNO. The keys
for this data set are STUDYNO. INVNO and PTNO.
keys of INVNO, PTNO and VDCALL, there are 2
additional keys in this data set. GPITYP indicates which
type of contact, Provider or Emergency Room, that the
procedures are part of. GPSEQITM is another counter
that takes the value of the counler TWSEQITM in
Provider contacts or a similar one, VWSEQITM, in
Emergency Room depending on the value of GPITYP.
526
instead of a number. The new reason W3S also placed in
the Reason Flags data set and given the next consecutive
number so it could be used for other events. Using this
The Hospitalization data set contains the admit and
discharge dales, the reason for the
admission~
any
procedures performed during the stay and any surgeries.
The keys for this data set are INVNO, PTNO, VDCALL
and HPSEQITM.
method allows all the events for a single reason to be tied
they are in separate data sets. This
provides 311 excellent example of how SAS can be used to
mimic a relational database by using common variables in
multiple data sets to link the data sets together. The keys
in Reason Flags are INVNO, PTNO, VDCALL and
WYRNO, the reason number.
together~ even if
The Other Medications data set contains the name and
dosing information for all prescription medications the
patient took during the study. The keys for this data set
are INVNO, PTNO, VDCALL, OMTYPE and
OMSEQITM.
From Flat File to SAS Data Set
The Home Care segment contains information on
assistance a patient received in their home from both paid
and unpaid sources. The data contained are type of
helper, number of days of care and the reason for the care.
The keys for this data set are INVNO, PTNO and
VDCALL.
The data was received in a series of flat files on tape and
several tapes were received over time.
Each tape
contained the calls made since the previous tape. Since
patients were accrued over time, at any point in time not
all patients would have had the same number of calls.
Therefore. the calls were grouped into files by the number
of the call. For example, when a tape was sent. the fifth
call made to any patient in the time frame covered by that
tape was put into a DAT AS file on the tape. The same
was done for any other number call which might have
been used during the time frame. This meant that we
might not always receive the same files on each tape. If
no baseline calls (call number 0) were made during the
time frame. then there would be no DATAO file on the
tape. When the files were copied onto a disk. a separate
library was created for each tape received since we usually
had DATA files of the same number on more than one
tape.
The Lost Work Days, Decreased Daily Activities and Bed
Rest segments all contain information on when a patient
was unable to participate in normal activities due to a
medical condition. Each data set contains the reason and
the number of days the patients's activities were affected.
The keys for each of these data sets are INVNO, PTNO
and VDCALL.
The final data set, Reason Flags, is the link between all
other data sets whi ch contain heath care utilization or what
we call events. All segments below Visit Data, except
Reason Flags, contain events as well as a reason for that
event. Reason Flags attempts to link those segments
together chronologically rather than hierarchically. This
was one of the factors we considered when we decided to
build our database in SAS instead of our usual DBMS.
The databases we routinely build are structured in a
strictly hierarchical manner. Since the Reason Flags
segment links together several segments, we were unable
to build that segment into a database using our standard
DBMS.
The first step when we received a tape was to get the data
into an intermediate SAS data set. This was done with a
OAT A step using an INFILE statement. Once the layout
of the flat tape file is known, this is a relatively simple
step. In our case, there were 3 different layouts possible.
depending on the telephone call number. These layouts
were sent in separate flat files on the tape each month.
We took the first set of layout files we received and
modified them to add a variable name for each field.
Each time a new tape was received, the new layout files
were compared with the modified files to determine if any
changes had been made to the layouts. The modified
layout files were then used as input into the program
For every event mentioned by a patient during a telephone
call a reason for the event was collected and a list of
reasons for all events was compiled.
Each reason was
assigned a number to identify it within that call. Then
each time a patient mentioned a new event, a list of the
previously mentioned reasons was presented and the
patient was asked if the current event was due to one of
the previously mentioned reasons. If it was, then the
number of the previously mentioned reason was inserted
in the data set to tie that event back to an observation in
the Reason Flag data set. If the event was due to a new
creating the intermediate SAS data sets to create the
INPUT statements. The code to accomplish this is
presented in figure 2.
Looking at the code in Figure 2 you can see that macro
processing is used extensively. Four macro variables are
created with %LET statements (not shown here) and are
defined as:
reason, the reason was inserted in the event data set
527
MAXDATA MAXTAPE MINDATAx MAXDATAx-
the largest DATAx file to be read
the number of tapes being used
the first DATAx file to be read in a
specific tape library
the last DAT Ax file to be read in a
specific tape library
variables and a permanent data set, STUDY.VW, is
created. In the actual code we used, all of the data sets
for the database were created in the same data step using
code similar to the code for Emergency Room.
Once all of the individual data sets were created, we
started to code the procedures.
What we vlanted was a
Each tape received has a number files on it, One file for
each set of phone calls made the previous month. The
files are designated DATAO - DATAI2, with 0 being a
baseline call. The files are copied from the tape into a
library on the VM mainframe by using a file type of
TAPEx, where x indicates which tape it is (i.e. first tape,
x=l; second tape, :<=2; etc.).
standard CPT4 code for each procedure, both prompted
and free text, so standard costs could be applied. The
prompted procedures were easy to code. In order to code
the free text procedures, a list of aU procedures was
printed and submitted to our Medical department They
assigned 1-6 codes for each procedure and returned the list
to us.
The code uses a %DO-%TO loop to process each DAT Ax
file individually. The macro first decides which layout
file, VARMAPO, VARMAPI or VARMAP3, is
appropriate to use for processing each flat file and sets a
macro variable, FOR<'\1, that can be used in the rest of the
code. The information from the layout file is then read
into a SAS data set. A DATA _NULL_ step is used to
create a series of macro variables which contain the
variable name, type, position and label for the data to be
read from each DATAx file.
At the same time the list was generated for our Medical
department, a SAS data set was created which contained
all the procedures and vanables to hold all the codes.
When the list was returned, the codes were entered int!)
the variables using data entry screens developed with
SAS!AF. When the data was analyzed, the file with the
codes was merged with the database by the procedure so
the CPT4 codes were available.
The next step is to read the DATAx file and create a SAS
data set with I observation for each patien!. This is done
for each tape received, and then all the files for each type
of call are brought together into a single SAS data set.
Since each patient is being called only once per month,
the end result should be a series of SAS data sets, one for
each call, each containing 1 observation per patient. The
final data sets are named SUGI.DATA&M where &M is
the number of the phone call.
By storing data in several SAS data sets which are linked
by common variables. end users can manage, update and
analyze the data in much the same way they would if the
data were stored in a database. We have created such a
database at our company. By merging costs associated
with events stored in the database, our end users are able
to analyze the economic benefits associated with the drug
treatment. SAS was used for all phases of the project,
starting with reading data from flat files and progressing
through data entry and correction. SAS also allowed us
to define relationships in the data which could not be
easily defined using our standard DBMS.
Conclusion
The code in Figure 3 is a partial listing of the code that
takes the SUGI.DATA&M data sets and makes the
individual data sets that make up the database. The macro
SETDATA makes it easier to deal with varying numbers
of input data sets. Since some of the earlier tapes we
received did not have calls all the way through call 12, we
did not always have 13 input data sets. Using macro code
makes it easier to deal with lhis situation since you need
only make 1 change near the top of the program. The
data step where EMROOM is created sets all of the
SUGI.DATA&M data sets together. Since the data sets
have multiple Emergency Room visits per observation,
these need to be broken out into separate observations in
the database. The DO - TO loop does this using array
processing to create the variables for the output data set
and then using an OUTPUT statement at the bottom of the
loop. Once this is done, the data set is sorted by the key
SAS is a registered trademark or trademark of SAS
Institute Inc. in the USA and other countries. @ indicates
USA registration.
Other brand and product names are registered trademarks
or trademarks of their respective companies.
528
FIGURE 1
SCHEMATIC OF DATABASE
INVESTIGATOR
I
PATIEl-IT
STUDY DRUG
ADMINISTRATION
HOSPITAIlZATJON E.MERGENCY PROVIDER
ROOM
CONTACTS
L,-J
GENERAL
STUDY
PROCEDURES
I
VISIT
DATA
HOME
CAFE
IDST
DECP.EASED
WORK DAILY
DAYS ACTIVITIES
REASON
FLAGS
529
I
PREMATURE
TERMINATIONS
BED
FIST
OTHER
MEDICATIONS
FIGURE 2
%MACRO IN:
%00 M=O %TO &M.6..XDATA:
%IF &M=O %THEN '%LET FORM==V ARMAF'Cr.
%ELSE %IF &M=l OR &.M=2. %THEN %I.ET FORM=VARl-.1APl;
%ELSE %LET FORM==VARMAP3;
CMS Fl FlLFORM DISK &FORM SAS A;
DATA FlLFORM;
IN"FILE FILFORM MISSOVER;
LENGTH POS 19;
INPUT VAA $ PUS S LEN $ TYPE S SASVAR $;
DATA _NULL_;
SET FILFORM END=EOF;
IF TYPE=' A' TIlEN CHAR='$';
CALL SYMPUTCV'IITRIM(LEFTLN_ll,SASVARl;
CALL SYMPUTCL'lrrRlMtLEFTLN_»),V ARl;
CALL SYMPUT(,PIJTRlMtLEFTCN-»,POS);
CALL SYMPUTCTIITRlMtLEFTCN_»,CHAR);
IF EOF THEN CALL SYMJ'IJT('N',_N_);
%DO J=I %TO
&MAXTAPE~
%IF ,&.&MINDATA&J I.E &M) AND(&M LE &&MAXDATA&J) %THEN %00;
eMS F1 INDATA&! DISK DATA&M T APE&J D:
DATA INDATA&J:
lNFlLE INDATA&J M1SS0VER;
VDCALL=&M:
%00 1=1 %TO &N~
INPUT && V&1 &&T &1 &.&.P&I @:
LABEL &&V&I;"&&L&I";
%END;
%END~
%END;
DATA INDATA:
SET
%DO J=1 %TO &MAXTAPE;
%Jf (&&MINDATMJ LE &M) AND (&M LE &&MAXDATA&.J) %TIlEN
INDATA&J:
%END;
PROC SORT:
BY INYNO PTNO;
DATAINDATA
SETINDATA;
BY INVNO PTNO;
IF NOT(FIRST.PTNO) OR NOT,LAST.PTNO) THEN
PUT lNVNO= PTNO= i'TMPRNO= fIRST.PTNO= LAST.PTNO=;
DATA SUGLDATA&M;
SETINDATA;
%END~
%MENDIN~
530
FIGURE 3
%!"lACRO SETDATA;
SET
%DO M=O %TO 12:
SUGI. DATA&M
%END;
:
%MEND SETDATA:
DATA EMROOM (KEEP=INVNO PTNO VDCALL VWFFN VWDTMO VWDTDY VWREAS VVlBPH
V'iIHY VWrF VWDDE VWSEQITM);
%SETDATA
***** UNROLL EMERGENCY ROOM LOOP
ARRAY EMFFN
{'} VWFFNI-VWFFN4:
ARRAY EMDTMO {*} VWDTM01-vwDTM04:
ARRPX EMDTDY {*} VWDTDY1-VWDTDY4:
ARRAY EMREP.s ( *) VWREAS 1 ~ VWREAS4 :
ARRAY EMBPH
{*} VWBPH1-VWBPH4;
ARRAY EMHY
{*J V'tlHY1-VWHY4:
ARRAY EMIF
{*} VWIF1-VWIF4:
ARRAY EI1DDE
{ *} VWDDE 1-VWDDE 4 :
DO r = 1 TO 4:
IF EMREAS {I} NE • • THu~ DO:
VWFFN = EHFFN {I};
VWDTMO = EMDTMO {I};
VWDTDY = EMDTDY {I};
VWREAS = EMREAS { I J ;
VWBPH = EMBPH II}:
VWHY = EMHY {I};
VVlIF = EMIF {I};
VWDDE = EMDDE II l;
VWSEQITM = I;
OUTPUT El-1ROOM;
END;
END;
" ••••• , ' , ••• " , . MAKE EMERGENCY ROOM SEGMENT
PRoe SORT DATA=EMROOM;
BY I~JNO PTNO VDCALL VWSEQITM;
DP.TA STUDY. V'iI;
SET EMROOM;
BY INVNO PTNO VDCALL VVISEQITM;
531
(VW)
* ".1<1<:1<**:1<*"****:1<-1<****.,