Download Construction and Management of a Medical Research Data Base Using RAQL and SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Patient safety wikipedia , lookup

Adherence (medicine) wikipedia , lookup

Electronic prescribing wikipedia , lookup

Transcript
CONSTRUCTION AND MANAGEMENT
O~
A MEDICAL RESEARCH DATABASE USING RAQL AND SAS
Ted G. Van Rossum
Jacob V. Aranda
McGill University - Montreal Children's Hospital Research Institute
RAOL, including increased accuracy and
productivity, are illustrated with
examples from our research.
It is
estimated that an order of magni tude
productivity increase was achieved in our
INTRODUCTION
The
Developmental
Pharmacology
and
Perinatal Research Unit (DPPRU) at the
study through the use of RAQL and SAS •
Montreal Children's Hospital, supported
by a grant from Health and Welfare
Canada, undertook an intensive prospec-
tive study on the epidemiology of drug
ADVANTAGES O~ THE RELATIONAL APPROACH
TO DATA MANAGEMENT
utilization and adverse drug reactions in
the newborn infant. The DPPRU monitored
1200 babies in the neonatal intensive
care unit over a 5 year period, recording
There are three main approaches to
database design, hierarchical, network,
and relational. In a research environment
where flexibility of data retrieval is
the main consideration, only the
relational approach ensures that interrelationships between any two variables
in the database can be easily assessed.
The simplicity of the relational data
model, a set of relations (ie. SAS tlatasets) also makes the relational approach
very appealing. No complex set of
pointers are required to maintain the
interrelationships between variables as
is
the
case
with the other
two
approaches.
pot-ient and mother history, medications
given, lab tests taken, feedings, intravenous sol utions, and physical examinations. The data totaling over 200,000
records, stored in 5 large SAS files, was
designed to
be an ongoing source of
information for the
determination of
incidence, types, patterns and factors
influencing drug utilization and adverse
drug reactionsa
The statistical analysis focused on
the calculation of crude incidence rates
and relative risk factors for toxicity to
the newborn's sensitive organs when
certain drugs were given. This type of
analysis required that for each drug
under study, the survey population would
be divided into 3 sub_populations: study,
control,
and exclusion.
Since the
criteria defining each sub-population
were different for each drug under study,
it was recognized that a powerful and
flexible method for sub-population
retrieval was essential.
The most attractive feature of the
relational approach is that it enables
the user to interact with his data at a
much higher level of abstraction through
the use of the relational operators.
These operators, based on mathematical
set theory, operate on whole relations at
a time producing a new relation as a
result. This is analogous to the higher
level of abstraction obtained when using
an assembler language versus a high level
programming
language
to
perform
mathematical operations. That is, the
user can think in mathematical terms and
not be concerned with the tedious
mechanics used to implement them. This
high level approach results in a clear,
concise and systematic methodology for
data retrieval.
Gi ven these data management requirements, it was decided that a relational
database system, because of its flexibility, simplicity and power would be
optimal. Since SAS was chosen to handle
the statistical analysis, RAQL
a new
relational query language embedded in
SAS, was chosen to perform the subpopulation retrieval. Together, SAS and
RAQL provided a comprehensive relational
database management system.
CREATING A RELATIONAL DATABASE
This paper , using examples from our
study, will cover three important topics.
The first topic is a discussion in
general terms of the power of the
relational approach to data management.
The second topic is the steps and issues
involved in creating a relational database. In our study, for example, we converted the original data, stored in 5
large unwieldy files, into 3/ SAS datasets to produce a simple relational database. The third topic is the use of RAQL
for the management of the relational
database. Many of the advantages of using
Commercial database systems generally
operate online, in real time, and in a
multi-user environment where updates,
additions and deletions are the order of
the day. This type of environment makes
comercial databases very complex to
design and maintain. Research databases,
on the otherhand, are easy to design
because, firstly, real time operation is
not needed and, secondly, research data
is generally static (i.e. no updates are
required once the database is properly
constructed). The elimination of these
two performance characteristics enables
280
The selection of key variables has an
important role to play in determining the
flexibility with which data is accessed.
To ensure that the user can access any
combination of variables in the database,
he must be able to join any two relations
in the database using their keys; either
directly or through a number of intermediate joins.
the designer to focus on the prime design
criterion of a research database, data
accessibility.
The
ij
maximum
follows.
steps involved in achieving
data accessibility are as
- grouping the data by
relations,
m~aning
into
STEP 3 : NORMALIZATION
- selection of keys,
Normalization is the process whereby
variables are removed from a
relation iri order to simplify the data
and thus produce a better representation
of the real world. Five levels of
normalization can be pursued; however,
for the realm of static databases (i.e.
no updates, additions or deletions are
made to the database), only the first
level of normalization is essential since
it ensures basic datu accessibility. The
second and third levels of normalization,
though not essential, are used to further
refine the database resulting in a
simpler data model and increased data
accessibil ity. The fourth and fifth
levels have little bearing on static
databases and so are not discussed here.
excess
- normalization of the relation
(data model simplification),
- and finally, permanent sort orders.
STEP 1
GROUP THE DATA BY HEANING
INTO RELATIONS
To convert the raw data into a
relational database the variables must
first be grouped into a number of
relations
( a relation can be thought
of as a SAS dataset with no duplicate
observations). The prime goal in grouping
variables is the creation of a set of
relations, each of which has a certain
meaning. That is, the aggregation of
variables in a relation should aptly
describe a single entity, event.., concept
or function.
BASIC DATA ACCESSIBILITY : FIRST
fORM
NORMAL
A relation is said to be in first
normal form if all of its variables are
simple. That is, each variable is a
single item and not another relation or
group of variables. For SAS datasets,
this condition is baSically satisfied.
To produce a database which is easy to
understand and use, relat10ns should have
the simplest possible meaning. These
elementary relations can then be used as
building blocks to be manipUlated by
relational operators to form other
relations describing more complex entity.
A more subtle form of non-simple
variable to be guarded against is the
repeating field. A repeating field is
defined as a number of variables with
identical meaning which, when taken
together, form a group. REL 1.0 in figure
1 includes an example of u repeating
field; the 3 identical variables
Figure 1 illustrates the results of
the first transfotmation of the original
study variables into 4 relations with
simple meaning. It should be emphazised
that this transformation is only the
first step in obtaining relations with
the simplest meaning possible. Step 3,
normalization, further pursues this
matter.
MEDICATION DURING LABOUR. These variables
taken together form the group of all the
medications given during labour. It was
necessary, therefore, to place this group
into a separate relation; REL1.1 in
figure 2.
STEP 2 : SELECTION OF KEY VARIABLES
DATA MODEL REFINEMENT AND INCREASED
ACCESSIBILIfY: SECOND AND THIRD NORMAL
FORMS
One or more variables, known as the
key of the relation, must be selected
which will uniquely identify each obser_
vation in the relation. Selecting a
unique key may not always be obvious. For
example, in REL2.0 in figure 1, all biochemistry tests for the same patient on
the same day could have the same results,
thereby producing a duplicate observation. Since relations should not have
duplicate
observations,
the
SEQUENCE NUMBER variable was added to the
relation-to ensure uniqueness.
The next step is to ensure that the
relation is in second normal form. That
is, ensure every non-key variable
is
functionally dependent on the full key.
The term functionally dependent is
defined as follows. Given two variables
PATIENT ID and BIRTHDATE, for example, we
can say that BIRTHDATE is functionally
dependent on PATIENT ID if and only if
for each value of PATI~NT ID there is
281
systematic
retrieval.
associated with it only one v'alue of
BIRTHDATE. More simply, a PATIENT ID can
have only one corresponding BIRTHDATE. A
patient's ADMISSION DATE on the other
hand is not functionally dependent on
PATIENT ID. A patient can be re-admitted
Based on the above example, REL1.2 in
figure 2 (data recorded on admission) has
2 variable
key,
.l
data
PATIENT ID and
ADMISSION DATE. Of the remaining
variables in this relation, only
RAQL EXA"PLES
DISCHARGE DATE is dependent on the two
The following ill ustrates the use of 4
RAQL operators, SELECT, PROJECT, JOIN,
and MINUS. A detailed description of all
the relational operators can be found in
variables-of the key. REL1.2 was there_
fore split into two relat.ions as
illustrated in figure 3. REL1.2.1 con_
tains the admission and discharge data,
while REL1.2.2 contains ~he patient
biographical data. It should be noted
that this transformation has also made
the meaning of the relations simpler.
AN INTRODUCTION TO DATABASE SYSTEMS
in the RAQL USER MANUAL.
A reI at ion is in third normal form if,
ignoring the_ key variables, none of the
remaining variables are functionally
dependent on each other. For example, in
REL1.2.2 in figure 3, BLOOD TYPE MOTHER
and BIRTHDATE MOTHER are both dependent
on MOTHER NAM-E. A simpler dat.) 10del is
obtained if this relationship 1s isolated
and placed in a relation of its own.
Figure 4 illustrates the new relations,
REL1.2.2.1, patient biographical data,
and REL1.2.?2 mother's biographical
data.
STEP
~
and
The SELECT operator creates a new
result relation by keeping only those
observations in the original operand
relation which meet the user specified
conditions (this is similar to the SAS
subsetting If). For example, to create a
new relation with the biographical data
of only male babies, the following RAQL
statement would be used.
MALE BABIES =
SELECT REL 1.2.2. 1
WHERE SEX = 'MALE'
The PROJECT operator creates a new
relation from the original by keeping
only those variables of the operand
relation given in the variable list. An
inportant feature of the PROJECT
operation is that duplicate observations
in the new relation are eliminated. For
example, to retrieve a 1 ist of all the
different drugs recorded in the survey,
the following RAQL statement would be
used.
: PERMANENT SORT ORDER
Most implementations of relational
operations, including RAOL, require that
copies of the operand relations be sorted
over the variabl~s named in the
operation. As a practical matter, the
database designer can reduce the high
cost of sorting by choosing a permanent
sort order for each relation, based on
the variables most often named in a
relational operation. In our database,
the permanent sort orders were the full
key in each relation.
ALL DIFFERENT DRUGS =
PROJECT REL3-:0
OVER DRUG NAME •
The concept of a JOIN is c lose- to that of
a SAS MERGE; a new relation is created by
combining two operand relations. The
variables of the result relation are a
combination of the variables from the two
operand relations (same as a SAS MERGE).
An observation for the result relation is
generated in each instance where the
values of common variables are equal.
Observations where common val ues cannot
be matched are not included. The JOIN is
more general than the SAS MERGE since it
properly handles the case where there are
duplicate values of the BY variable in
both data sets.
E
t
•
for
The following section will illustrate
the power of RAQL relational operators
through some examples of data retrieval
problems experienced at the DPPRU. Though
relational operations can be programmed
directly in SAS, it was estimated that in
our stUdy an order of magnitude reduction
in program design, coding and testing was
achieved through the use of RAQL for subpopulation retrieval.
with the same PATIENT ID therefore
resul ting 1n more than one-ADMISSION DATE
associated with the PATIENT 10.
-
a
methodology
MANluING A RELATIONAL DATABASE WITH BAQL
Having processed the data into a
simple relational database format, the
SAS user is now ready to systematically
manage data retrieval tasks using RAOL, a
new high level relational query language.
RAQL statements placed in a SAS program
prov ide the user with the fu 11 set of
relational operators thus enabling both
data retrieval and analysis in the same
program. This synergy of RAQL and SAS
provides the user with a concise and
282
An exampl e of the use of the JOIN is as
Figure 5 illustrates a RAQL program to
define the above populations. It should
follows: to correlate MEDICATIONS DURINGLABOUR from REL1.l figure 2 and-patient
be noted that the RAQL program to define
DIAGNOSIS from RELlI.O figure 1 f a new
relation containing both these variables
must first be created. Since these 2
relation3 have the variable PATIENT 10 in
common, the following RAQL statement can
be used to produce the required relation.
these sub-populations required only 8
statements; a SAS program to perform the
same task would require from 1+ to 10
times more statements.
CONCLUSION
DIAGNOSIS AND MEDICATION ;
JOIN REL4.0-AND REL1.l
OVER PATIENT 10.
Simplifcation of the data into the
relational database format and the power
of the relational operators combine to
provide the"user with a powerful set of
The MINUS operator is like the JOIN in
that it uses 2 operand relations to
produce a result relation. The variables
in the result relation are thoe3 of the
first operand relation only. The MINUS
operator produces a new relation by
"subtracting" or eliminating from the
first operand relation all observations
having variable values in common with the
second operand rel~tion.
tool s. In the case of the DPPRU at the
Montreal Children's Hospital, we found a
great increa3e in productivity, and more
importantly, a concise methodology which
allowed us to easily retrieve the various
sub-populations required.
ACKNOWLEDGEMENTS
The authors would like to thank
Michael Gilman for the invalua-ble
assistance in preparing this paper.
For example, to create a relation with
the biographical data of all patients who
did not recieve any medication while in
hospital,the following RAQL statement can
be used. This operation, in effect,
subtracts the "medications given"
relation from the "biographical data"
relation, using PATIENT ID as the common
variable.
-
AUTHOR CONTACT
Ted Van Rossum
Developmental Pharmacology and Perinatal
Research unit,
Rm. A-604
Montreal Children's Hospital
2300 Tupper St., Montreal,
QUE.
Canada
H3H 1 P3
NO MEDS RECEIVED ;
REL1.2.2.1 MINUS REL4.0
OVER PATIENT ID.
EXAMPLE DATABASE QUERY
REFERENCES
Most queries to the database involved
many steps, the first and most important
of which is the definition of subpopulations. A sub-population is defined
by creating a relation which contains a
list of all the patients fitting the subpopulation description. The following
example illustrates a typical subpopulation definition step.
1. Bragg, A. W. ItNonprocedural Query
Facility For The Casual SAS User", SUGI
Conference Proceedings, 19B1.
Burrage, D. and Gilman, M. "RAQL - An
Evolution in SAS Data Management", SUGI
Conference Procedings, 1983.
2.
To test
the
hypothesis
that
gentamycin, produces kidney failure in
the newborn, the following sub-populations had to be defined.
3. Cardenas, A. F. Database Management
Systems, All yn and "Beacon Inc., Boston
1979.
a) EXCLUSION POPULATION: those patients
3.
Date, C. J. An
Introduction
To
Database Systems, 3rd edition, Addison-
who had chronic kidney failure, and
should therefore be excluded from the
analysis.
Wesley, 1981.
5. Gilman, H. and Burrage, D. RAQL User
Manual, McGill University, Montreal,
STUDY POPULATION: those patients who
did not have chronic kidney fail ure and
b)
Canada 1982.
were given the drug gentamycin.
6.
Merrett, T. H., A Relational
Information System, Reston, 1983.
c) CONTROL POPULATION: those patients
who did not have chronic kidney failure
and did not receive gentamycin.
283
FIGURE 1
RELATION
MEANING
EXAMPLES OF RELATIONS GROUPED
BY MEANING
REL 1. a
Data recorded on each
RELATION
MEANING
admission to the neonatal
intensive care unit.
VARIABLES
• PATIENT ID
• ADMISSION DATE
DISCfIARGCDATE
NAME BAB¥BIRTHDATE BAB¥
SEX
GESTATION AGE
BIRTH WEIGHT
MEDICATION DURING LABOUR #1
MEDICATION-DURING-LABOUR 82
MEDICATION-DURING-LABOUR #3
BLOOD TYPCBAB¥
NAME MOTHER
BIRTHDATE MOTHER
BLOOD_TYPE_MOTHER
: REL1.2
: Data recorded on each
admission to the neonatal
intensive care unit.
• PATIENT ID
• ADMISSION DATE
DISCHARGCDATE
NAME BABYBIRTHDATE BABY
SEX
GESTATION AGE
BIRTH WEIGHT
BLOOD-TYPE BABY
NAME MOTHER
BLOOD TYPE MOTHER
BIRTHliATE MOTHER
REL2.0
FIGURE 3
TRANSFORM REL1.2 INTO SECOND
NORMAL FORk
REL3.0
~edications
given.
RELATION
MEANING
VARIABLES
PATIENT ID
DRUG NAME
START DATE
STOP DATE
DOSC
REL1.2.1
Patient stay.
• PATIENT ID
• ADMISSIOM DATE
DISCHARGCDATE
REL4.0
RELATION
MEANING
VARIABLES
Patient diagnoses.
• PATIENT ID
• DATE
• DIAGNOSIS
NOTE
RELATION
MEANING
Biochemistry blood test result
•
•
•
•
RELATION
MEANING
VARIABLES
during
VARIABLES
• PATIENT ID
• DATE
• SEQUENCE NUMBER
BUN
CREATINE
RELATION
MEANING
VARIABLES
given
• PATIENT ID
• MEDICATIoN DURING LABOUR
VARIABLES
RELATION
MEANING
VARIABLES
REL 1.1
Medications
labour
REL1.2.2
Patient's biographical data.
• PATIENT ID
NAME BABY
BIRTHDATE BABY
SEX
GESTATION AGE
BIRTH WEIGllT
BLOOD-TYPE BABY
NAME MOTHER
BLOOD TYPE MOTHER
BIRTHDATE MOTHER
Asterisk denotes key variables.
NOTE
284
Asterisk denotes key variables.
FIGURE 4
RELATION
IIEANING
VARIABLES
TRANSFORM RELI.2.2 INTO THIRD
NORIIAL FORII
FIGURE 5
REL1.2.2.1
"ALL COMMENTS START WITH AN ASTERISK
Patient's biographical data
"CREATE EXCLUSION POPULATION (I.E.
"PATIENTS WITH CHRONIC KIDNEY FAILURE)
" PATIENT ID
NAME BASY
BIRTHDATE BABY
SEX
GESTA TION AGE
BIRTH WEIGHT
BLOOD-TYPE BABY
NAME MOTHER
RELATION
MEANING
RAQL EXAMPLE, CREATE STUDY AND
CONTROL POPULATIONS
CHRONIC KIDNEY FAILURE PATIENTS =
SELECT RELUi
WHERE DIAGNOSIS = 'CHRONIC KIDNEY
FAILURE' •
EXCLUSION POPULATION =
PROJECT-CHROnIC KIDnET FAILURE PATIEnTS
OVER PATIENT ID.
REL1.2.2.2
Mother's biographical data
variables
'CREATE THE COMBINED STUDY PLUS CONTROL
"POPULATIONS BY ELIMINATING THE EXCLUSION
"POPULATION FROM FURTHER CONSIDERATION
" NAME MOTHER
BLOOD TYPE MOTHER
BIRTHDATE MOTHER
NOTE
"NOTE: REL1.2.2 IS BIOGRAPHICAL DATA
"
FOR EACH PATIENT
Asterisk denotes key variables.
NON CHRONIC KIDNEY FAILURE PATIENTS =
REL 1.2.2 -MINUS EXCLUSION POPULATION
OVER PATIENT 10.
NON CHRONIC KIDNEY FAILURE POPULATION
PROJECT
-
-
=
NON CHRONIC KIDNEY FAILURE PATIENTS
ovF:R PATIEN"!' 10. -
"CREATE POPULATION
"RECEIVED GENTAMYCIN
OF
PATIENTS
WHO
PATIENTS WHO RECEIVED GENTAIIYCIN =
SELECT- REL3.0
WHERE DRUG NAME = 'GENTAMYCIN'.
ALL GENTAMTCIN POPULATION =
PROJECT
PATIENTS WHO RECIEVED GENTAIIYCIN
OV ER PATIENT- 10.
-
'CREATE STUDY POPULATION
STUDT POPULATION =
JOIN ALL GEnTAMTCIn POPULATIon
AND
NON CHRONIC KIDNET FAILURE POPULATION
OVER PATIENT 10. -
'CREATE CONTROL POPULATION
conTROL POPULATION =
Non CHRonIC KIDNEY FAILURE POPULATION
MINUS STUDT POPULATION
OVER PATIENt'" ID.
285