Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CONSTRUCTION AND MANAGEMENT O~ A MEDICAL RESEARCH DATABASE USING RAQL AND SAS Ted G. Van Rossum Jacob V. Aranda McGill University - Montreal Children's Hospital Research Institute RAOL, including increased accuracy and productivity, are illustrated with examples from our research. It is estimated that an order of magni tude productivity increase was achieved in our INTRODUCTION The Developmental Pharmacology and Perinatal Research Unit (DPPRU) at the study through the use of RAQL and SAS • Montreal Children's Hospital, supported by a grant from Health and Welfare Canada, undertook an intensive prospec- tive study on the epidemiology of drug ADVANTAGES O~ THE RELATIONAL APPROACH TO DATA MANAGEMENT utilization and adverse drug reactions in the newborn infant. The DPPRU monitored 1200 babies in the neonatal intensive care unit over a 5 year period, recording There are three main approaches to database design, hierarchical, network, and relational. In a research environment where flexibility of data retrieval is the main consideration, only the relational approach ensures that interrelationships between any two variables in the database can be easily assessed. The simplicity of the relational data model, a set of relations (ie. SAS tlatasets) also makes the relational approach very appealing. No complex set of pointers are required to maintain the interrelationships between variables as is the case with the other two approaches. pot-ient and mother history, medications given, lab tests taken, feedings, intravenous sol utions, and physical examinations. The data totaling over 200,000 records, stored in 5 large SAS files, was designed to be an ongoing source of information for the determination of incidence, types, patterns and factors influencing drug utilization and adverse drug reactionsa The statistical analysis focused on the calculation of crude incidence rates and relative risk factors for toxicity to the newborn's sensitive organs when certain drugs were given. This type of analysis required that for each drug under study, the survey population would be divided into 3 sub_populations: study, control, and exclusion. Since the criteria defining each sub-population were different for each drug under study, it was recognized that a powerful and flexible method for sub-population retrieval was essential. The most attractive feature of the relational approach is that it enables the user to interact with his data at a much higher level of abstraction through the use of the relational operators. These operators, based on mathematical set theory, operate on whole relations at a time producing a new relation as a result. This is analogous to the higher level of abstraction obtained when using an assembler language versus a high level programming language to perform mathematical operations. That is, the user can think in mathematical terms and not be concerned with the tedious mechanics used to implement them. This high level approach results in a clear, concise and systematic methodology for data retrieval. Gi ven these data management requirements, it was decided that a relational database system, because of its flexibility, simplicity and power would be optimal. Since SAS was chosen to handle the statistical analysis, RAQL a new relational query language embedded in SAS, was chosen to perform the subpopulation retrieval. Together, SAS and RAQL provided a comprehensive relational database management system. CREATING A RELATIONAL DATABASE This paper , using examples from our study, will cover three important topics. The first topic is a discussion in general terms of the power of the relational approach to data management. The second topic is the steps and issues involved in creating a relational database. In our study, for example, we converted the original data, stored in 5 large unwieldy files, into 3/ SAS datasets to produce a simple relational database. The third topic is the use of RAQL for the management of the relational database. Many of the advantages of using Commercial database systems generally operate online, in real time, and in a multi-user environment where updates, additions and deletions are the order of the day. This type of environment makes comercial databases very complex to design and maintain. Research databases, on the otherhand, are easy to design because, firstly, real time operation is not needed and, secondly, research data is generally static (i.e. no updates are required once the database is properly constructed). The elimination of these two performance characteristics enables 280 The selection of key variables has an important role to play in determining the flexibility with which data is accessed. To ensure that the user can access any combination of variables in the database, he must be able to join any two relations in the database using their keys; either directly or through a number of intermediate joins. the designer to focus on the prime design criterion of a research database, data accessibility. The ij maximum follows. steps involved in achieving data accessibility are as - grouping the data by relations, m~aning into STEP 3 : NORMALIZATION - selection of keys, Normalization is the process whereby variables are removed from a relation iri order to simplify the data and thus produce a better representation of the real world. Five levels of normalization can be pursued; however, for the realm of static databases (i.e. no updates, additions or deletions are made to the database), only the first level of normalization is essential since it ensures basic datu accessibility. The second and third levels of normalization, though not essential, are used to further refine the database resulting in a simpler data model and increased data accessibil ity. The fourth and fifth levels have little bearing on static databases and so are not discussed here. excess - normalization of the relation (data model simplification), - and finally, permanent sort orders. STEP 1 GROUP THE DATA BY HEANING INTO RELATIONS To convert the raw data into a relational database the variables must first be grouped into a number of relations ( a relation can be thought of as a SAS dataset with no duplicate observations). The prime goal in grouping variables is the creation of a set of relations, each of which has a certain meaning. That is, the aggregation of variables in a relation should aptly describe a single entity, event.., concept or function. BASIC DATA ACCESSIBILITY : FIRST fORM NORMAL A relation is said to be in first normal form if all of its variables are simple. That is, each variable is a single item and not another relation or group of variables. For SAS datasets, this condition is baSically satisfied. To produce a database which is easy to understand and use, relat10ns should have the simplest possible meaning. These elementary relations can then be used as building blocks to be manipUlated by relational operators to form other relations describing more complex entity. A more subtle form of non-simple variable to be guarded against is the repeating field. A repeating field is defined as a number of variables with identical meaning which, when taken together, form a group. REL 1.0 in figure 1 includes an example of u repeating field; the 3 identical variables Figure 1 illustrates the results of the first transfotmation of the original study variables into 4 relations with simple meaning. It should be emphazised that this transformation is only the first step in obtaining relations with the simplest meaning possible. Step 3, normalization, further pursues this matter. MEDICATION DURING LABOUR. These variables taken together form the group of all the medications given during labour. It was necessary, therefore, to place this group into a separate relation; REL1.1 in figure 2. STEP 2 : SELECTION OF KEY VARIABLES DATA MODEL REFINEMENT AND INCREASED ACCESSIBILIfY: SECOND AND THIRD NORMAL FORMS One or more variables, known as the key of the relation, must be selected which will uniquely identify each obser_ vation in the relation. Selecting a unique key may not always be obvious. For example, in REL2.0 in figure 1, all biochemistry tests for the same patient on the same day could have the same results, thereby producing a duplicate observation. Since relations should not have duplicate observations, the SEQUENCE NUMBER variable was added to the relation-to ensure uniqueness. The next step is to ensure that the relation is in second normal form. That is, ensure every non-key variable is functionally dependent on the full key. The term functionally dependent is defined as follows. Given two variables PATIENT ID and BIRTHDATE, for example, we can say that BIRTHDATE is functionally dependent on PATIENT ID if and only if for each value of PATI~NT ID there is 281 systematic retrieval. associated with it only one v'alue of BIRTHDATE. More simply, a PATIENT ID can have only one corresponding BIRTHDATE. A patient's ADMISSION DATE on the other hand is not functionally dependent on PATIENT ID. A patient can be re-admitted Based on the above example, REL1.2 in figure 2 (data recorded on admission) has 2 variable key, .l data PATIENT ID and ADMISSION DATE. Of the remaining variables in this relation, only RAQL EXA"PLES DISCHARGE DATE is dependent on the two The following ill ustrates the use of 4 RAQL operators, SELECT, PROJECT, JOIN, and MINUS. A detailed description of all the relational operators can be found in variables-of the key. REL1.2 was there_ fore split into two relat.ions as illustrated in figure 3. REL1.2.1 con_ tains the admission and discharge data, while REL1.2.2 contains ~he patient biographical data. It should be noted that this transformation has also made the meaning of the relations simpler. AN INTRODUCTION TO DATABASE SYSTEMS in the RAQL USER MANUAL. A reI at ion is in third normal form if, ignoring the_ key variables, none of the remaining variables are functionally dependent on each other. For example, in REL1.2.2 in figure 3, BLOOD TYPE MOTHER and BIRTHDATE MOTHER are both dependent on MOTHER NAM-E. A simpler dat.) 10del is obtained if this relationship 1s isolated and placed in a relation of its own. Figure 4 illustrates the new relations, REL1.2.2.1, patient biographical data, and REL1.2.?2 mother's biographical data. STEP ~ and The SELECT operator creates a new result relation by keeping only those observations in the original operand relation which meet the user specified conditions (this is similar to the SAS subsetting If). For example, to create a new relation with the biographical data of only male babies, the following RAQL statement would be used. MALE BABIES = SELECT REL 1.2.2. 1 WHERE SEX = 'MALE' The PROJECT operator creates a new relation from the original by keeping only those variables of the operand relation given in the variable list. An inportant feature of the PROJECT operation is that duplicate observations in the new relation are eliminated. For example, to retrieve a 1 ist of all the different drugs recorded in the survey, the following RAQL statement would be used. : PERMANENT SORT ORDER Most implementations of relational operations, including RAOL, require that copies of the operand relations be sorted over the variabl~s named in the operation. As a practical matter, the database designer can reduce the high cost of sorting by choosing a permanent sort order for each relation, based on the variables most often named in a relational operation. In our database, the permanent sort orders were the full key in each relation. ALL DIFFERENT DRUGS = PROJECT REL3-:0 OVER DRUG NAME • The concept of a JOIN is c lose- to that of a SAS MERGE; a new relation is created by combining two operand relations. The variables of the result relation are a combination of the variables from the two operand relations (same as a SAS MERGE). An observation for the result relation is generated in each instance where the values of common variables are equal. Observations where common val ues cannot be matched are not included. The JOIN is more general than the SAS MERGE since it properly handles the case where there are duplicate values of the BY variable in both data sets. E t • for The following section will illustrate the power of RAQL relational operators through some examples of data retrieval problems experienced at the DPPRU. Though relational operations can be programmed directly in SAS, it was estimated that in our stUdy an order of magnitude reduction in program design, coding and testing was achieved through the use of RAQL for subpopulation retrieval. with the same PATIENT ID therefore resul ting 1n more than one-ADMISSION DATE associated with the PATIENT 10. - a methodology MANluING A RELATIONAL DATABASE WITH BAQL Having processed the data into a simple relational database format, the SAS user is now ready to systematically manage data retrieval tasks using RAOL, a new high level relational query language. RAQL statements placed in a SAS program prov ide the user with the fu 11 set of relational operators thus enabling both data retrieval and analysis in the same program. This synergy of RAQL and SAS provides the user with a concise and 282 An exampl e of the use of the JOIN is as Figure 5 illustrates a RAQL program to define the above populations. It should follows: to correlate MEDICATIONS DURINGLABOUR from REL1.l figure 2 and-patient be noted that the RAQL program to define DIAGNOSIS from RELlI.O figure 1 f a new relation containing both these variables must first be created. Since these 2 relation3 have the variable PATIENT 10 in common, the following RAQL statement can be used to produce the required relation. these sub-populations required only 8 statements; a SAS program to perform the same task would require from 1+ to 10 times more statements. CONCLUSION DIAGNOSIS AND MEDICATION ; JOIN REL4.0-AND REL1.l OVER PATIENT 10. Simplifcation of the data into the relational database format and the power of the relational operators combine to provide the"user with a powerful set of The MINUS operator is like the JOIN in that it uses 2 operand relations to produce a result relation. The variables in the result relation are thoe3 of the first operand relation only. The MINUS operator produces a new relation by "subtracting" or eliminating from the first operand relation all observations having variable values in common with the second operand rel~tion. tool s. In the case of the DPPRU at the Montreal Children's Hospital, we found a great increa3e in productivity, and more importantly, a concise methodology which allowed us to easily retrieve the various sub-populations required. ACKNOWLEDGEMENTS The authors would like to thank Michael Gilman for the invalua-ble assistance in preparing this paper. For example, to create a relation with the biographical data of all patients who did not recieve any medication while in hospital,the following RAQL statement can be used. This operation, in effect, subtracts the "medications given" relation from the "biographical data" relation, using PATIENT ID as the common variable. - AUTHOR CONTACT Ted Van Rossum Developmental Pharmacology and Perinatal Research unit, Rm. A-604 Montreal Children's Hospital 2300 Tupper St., Montreal, QUE. Canada H3H 1 P3 NO MEDS RECEIVED ; REL1.2.2.1 MINUS REL4.0 OVER PATIENT ID. EXAMPLE DATABASE QUERY REFERENCES Most queries to the database involved many steps, the first and most important of which is the definition of subpopulations. A sub-population is defined by creating a relation which contains a list of all the patients fitting the subpopulation description. The following example illustrates a typical subpopulation definition step. 1. Bragg, A. W. ItNonprocedural Query Facility For The Casual SAS User", SUGI Conference Proceedings, 19B1. Burrage, D. and Gilman, M. "RAQL - An Evolution in SAS Data Management", SUGI Conference Procedings, 1983. 2. To test the hypothesis that gentamycin, produces kidney failure in the newborn, the following sub-populations had to be defined. 3. Cardenas, A. F. Database Management Systems, All yn and "Beacon Inc., Boston 1979. a) EXCLUSION POPULATION: those patients 3. Date, C. J. An Introduction To Database Systems, 3rd edition, Addison- who had chronic kidney failure, and should therefore be excluded from the analysis. Wesley, 1981. 5. Gilman, H. and Burrage, D. RAQL User Manual, McGill University, Montreal, STUDY POPULATION: those patients who did not have chronic kidney fail ure and b) Canada 1982. were given the drug gentamycin. 6. Merrett, T. H., A Relational Information System, Reston, 1983. c) CONTROL POPULATION: those patients who did not have chronic kidney failure and did not receive gentamycin. 283 FIGURE 1 RELATION MEANING EXAMPLES OF RELATIONS GROUPED BY MEANING REL 1. a Data recorded on each RELATION MEANING admission to the neonatal intensive care unit. VARIABLES • PATIENT ID • ADMISSION DATE DISCfIARGCDATE NAME BAB¥BIRTHDATE BAB¥ SEX GESTATION AGE BIRTH WEIGHT MEDICATION DURING LABOUR #1 MEDICATION-DURING-LABOUR 82 MEDICATION-DURING-LABOUR #3 BLOOD TYPCBAB¥ NAME MOTHER BIRTHDATE MOTHER BLOOD_TYPE_MOTHER : REL1.2 : Data recorded on each admission to the neonatal intensive care unit. • PATIENT ID • ADMISSION DATE DISCHARGCDATE NAME BABYBIRTHDATE BABY SEX GESTATION AGE BIRTH WEIGHT BLOOD-TYPE BABY NAME MOTHER BLOOD TYPE MOTHER BIRTHliATE MOTHER REL2.0 FIGURE 3 TRANSFORM REL1.2 INTO SECOND NORMAL FORk REL3.0 ~edications given. RELATION MEANING VARIABLES PATIENT ID DRUG NAME START DATE STOP DATE DOSC REL1.2.1 Patient stay. • PATIENT ID • ADMISSIOM DATE DISCHARGCDATE REL4.0 RELATION MEANING VARIABLES Patient diagnoses. • PATIENT ID • DATE • DIAGNOSIS NOTE RELATION MEANING Biochemistry blood test result • • • • RELATION MEANING VARIABLES during VARIABLES • PATIENT ID • DATE • SEQUENCE NUMBER BUN CREATINE RELATION MEANING VARIABLES given • PATIENT ID • MEDICATIoN DURING LABOUR VARIABLES RELATION MEANING VARIABLES REL 1.1 Medications labour REL1.2.2 Patient's biographical data. • PATIENT ID NAME BABY BIRTHDATE BABY SEX GESTATION AGE BIRTH WEIGllT BLOOD-TYPE BABY NAME MOTHER BLOOD TYPE MOTHER BIRTHDATE MOTHER Asterisk denotes key variables. NOTE 284 Asterisk denotes key variables. FIGURE 4 RELATION IIEANING VARIABLES TRANSFORM RELI.2.2 INTO THIRD NORIIAL FORII FIGURE 5 REL1.2.2.1 "ALL COMMENTS START WITH AN ASTERISK Patient's biographical data "CREATE EXCLUSION POPULATION (I.E. "PATIENTS WITH CHRONIC KIDNEY FAILURE) " PATIENT ID NAME BASY BIRTHDATE BABY SEX GESTA TION AGE BIRTH WEIGHT BLOOD-TYPE BABY NAME MOTHER RELATION MEANING RAQL EXAMPLE, CREATE STUDY AND CONTROL POPULATIONS CHRONIC KIDNEY FAILURE PATIENTS = SELECT RELUi WHERE DIAGNOSIS = 'CHRONIC KIDNEY FAILURE' • EXCLUSION POPULATION = PROJECT-CHROnIC KIDnET FAILURE PATIEnTS OVER PATIENT ID. REL1.2.2.2 Mother's biographical data variables 'CREATE THE COMBINED STUDY PLUS CONTROL "POPULATIONS BY ELIMINATING THE EXCLUSION "POPULATION FROM FURTHER CONSIDERATION " NAME MOTHER BLOOD TYPE MOTHER BIRTHDATE MOTHER NOTE "NOTE: REL1.2.2 IS BIOGRAPHICAL DATA " FOR EACH PATIENT Asterisk denotes key variables. NON CHRONIC KIDNEY FAILURE PATIENTS = REL 1.2.2 -MINUS EXCLUSION POPULATION OVER PATIENT 10. NON CHRONIC KIDNEY FAILURE POPULATION PROJECT - - = NON CHRONIC KIDNEY FAILURE PATIENTS ovF:R PATIEN"!' 10. - "CREATE POPULATION "RECEIVED GENTAMYCIN OF PATIENTS WHO PATIENTS WHO RECEIVED GENTAIIYCIN = SELECT- REL3.0 WHERE DRUG NAME = 'GENTAMYCIN'. ALL GENTAMTCIN POPULATION = PROJECT PATIENTS WHO RECIEVED GENTAIIYCIN OV ER PATIENT- 10. - 'CREATE STUDY POPULATION STUDT POPULATION = JOIN ALL GEnTAMTCIn POPULATIon AND NON CHRONIC KIDNET FAILURE POPULATION OVER PATIENT 10. - 'CREATE CONTROL POPULATION conTROL POPULATION = Non CHRonIC KIDNEY FAILURE POPULATION MINUS STUDT POPULATION OVER PATIENt'" ID. 285