* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Rocheford Research on Creating a Human Environment for On-Line Research Tools
Survey
Document related concepts
Concurrency control wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Versant Object Database wikipedia , lookup
Data vault modeling wikipedia , lookup
Information privacy law wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Business intelligence wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
the ROC H E FOR T project A General Interface between Relational Databases and command file driven statistical packages. Leo K.J. van Romunde 1), Tom P.H. Troquay 1), Rien A. Hilhorst 2) 1) Emsmus University Rotterdam. Department of Epidemiology. Postbox 1738. NL 3000 DR Rotterdam. The Netherlands. 2) Ministry of Agriculture and Fisheries of the Netherlands. Organization and Efficiency Department. Postbox 20401. NL 2500 EK The Hague. The Netherlands. . Summary ROCHEFORT is a research project with the aim to develop an integrated and intelligent information system with the following topics: 1] a genera.. interface between relational database management systems (RDBMSs) and statistical packages 2] exploiting the strength of both by allowing - the RDBMS to manipulate the data - the statistical package for statistical analyses 3] with features for transformation calculations and aggregation of data 4] and easy automatic syntax generation (statistical command file and job control language) 5] offering an unified user front end 6] storing all relevant information and knowledge in a database. on data structures, This concept will offer expert system capabilities in the future. The first prototype is developed by means of the Structured Query Language (SQL) based RDBMS ORACLE and will interface to the SAS package. Initial implementation has been on a Digital VAX under VMS, but portability is foreseen. 527 Introduction Relational Database Management Systems (RDBMSs) become more and more important in data processing environments. Probably RDBMSs will replace most of the file handing in the next two decades. This process is called the silent revolution (I). Statisticians cannot neglect this revolution. In order to cope with future statistical data processing they have to learn the basic principles of RDBMSs and the language SQL (Structured Query Language) which is implemented in the better relational software products. The SQL language, where sets of records can be selected or updated in one statement, is a fourth generation language, in contrast to the third generation languages as PLII Fortran and C where extensive programs have to be written in order to deal with sets of records and record linkage. However, the statistician has also to cope with the limitations (constraints) of RDBMSs which are implemented to guarantee data integrity at the moment of data entry or data correction. Therefore, more data manipulation is needed before analysis compared to the. conventional data storage methods. RDBMSs contain not only data but also information about the data, the so called metadata. In addition, RDBMSs can be loaded with information about statistical packages and procedures. Even the syntax of statistical packages and procedures can be stored in the database. That information can be used to select, menu driven, the appropriate package and procedure and to generate the command and datafile for subsequent analysis. The problems to use RDBMSs for production and research have been inventoried two years ago by a group researchers from the Erasmus University and concurrently by a group from the Ministry of Agriculture and Fisheries. The database company Oracle brought both groups together and a task force has been formed to produce a general interface between RDBMSs and statistical software. In this article we will first explain some aspects of relational databases in contrast to file systems and other database systems. Subsequently we will explain the functions of the interface and finally we will discuss the product in relation to SAS. File systems and databases. All SAS users are accustomed to use files or datasets. A conceptual difference between files and datasets is only made in IBM manuals, where a file is internal to the program and a dataset is a name for physical storage of data and programs on tape or disk and where the Job Control Language is used to connect files to datasets. However, in most other operating systems, datasets are directly accessed from the program, and a difference between files and datasets is meaningless in such cases. In this article we will use the term file instead of dataset. Files can be categorized into fixed length files and variable length files (see figure I). Most statisticians did their first computer analyses on fixed files, which a fixed number of variables 528 on a limited number of observations. Variable length files are often used for repeated measurements, as for instance repeated blood pressure measurements (figure 2). SEQUENTIAL FIXED 0001 M 21-10-48 0002 F 12-05-52 0003 F 07-09-30 Figure 1 SEQUENTIAL VARIABLE 0001 M 21-10-48 12-12-75 090/130 12-2-78 095/1401 0002 F 12-05-52 17-12-75 080/132\'-_ _ _ _ _ , 0003 F 07-09-30 19-12-75 085/140 19-2-78 110/160l Figure 2 Files can also be categorized according the how they can be addressed. Most files are sequential. Figure 3 shows first a disk with a sequential file. The information is stored as a sequence. Reading and writing starts always at the begin of the file. This way of information storage can be inefficient for data entry or data correction. Therefore the file can be subdivided into regions (figure 3: regional) and reading and writing can be started at a region boundary. Another way to optimize information retrieval is to work like a book with sequential information and an index. Searches are first performed on the index and subsequently on the data (figure 3: indexed sequential). This is the index sequential method. 529 SEQUENTIAL REGIONAL INDEXED SEQUENTIAL Figure 3 A lot of technical and organizational knowledge was needed to build information systems for factories, financial institutions, hospital etc. with these ingredients. Program and data maintenance problems were already great in the late sixties, when computers were introduced for administrative work. A new term has been introduced namely "databases". Databases should be less hardware dependent and the same information should not be stored twice. In addition, the metadata (information about the data) should be stored in the database as well. Four types databases have been produced since that time. The first type was the hierarchical database, where information was structured like a tree (figure 4). This database type functioned quite well when information was retrieved according to the tree. In figure 4 we see a hospital information tree. The basic root is the hospital. Smaller roots are the patients. The smallest roots are the visits and the leaves are the examinations. Information about the examination on uric acid on 20-may-8J of patient X can be found quickly, but to collect information about high uric acid and renal stones is complicated and requires a sequential search through all leaves. In addition not all information can be structured into trees. Even a genealogy tree, which is rather tree like, is not a real tree, and cannot be structured according to an hierarchical model. f,3D _____Exam Hierarchical DB Figure 4 The constraint of the tree structure was released by the introduction of network databases. Network databases provided facilities to interconnect information without tree structure. However, the interconnections had to be known in advance. They formed the network between the basic information (entities) (figure 5). Conceptual, not technical rules were provided to form the entities from the data. The rules were called the NORMAL rules and they should help to maintain data integrity. As mentioned before data should not be stored twice , because of possible correction problems, where one version could be corrected and the other not. The fact that interconnections had to be known in advance was problematic in several circumstances, specially where production data had also to be used for management information systems or research systems. Queries for management and research are often market dependent and therefore unpredictable. This lead to the construction and introduction of the relational database type. 531 Therapy Visits Network DB Figure 5 Relational databases have only entities (or objects) and the connection between these objects is established at the moment of the query (figure 6). This makes relational databases very flexible at some expense of speed. In fact one can forget about the physical storage, about trees and even about al possible interconnections. The important things which remain are mainly conceptual, namely the construction of objects, the normal forms and the constraints for data integrity and relational integrity (integrity between objects). Information in different objects can easily be joined using the already mentioned Structured Query Language (SQL). (no permanent connections) Patients Visits Examinations Therapy Relational DB Figure 6 532 The newest development in this field is the semantical database. The semantical database resembles most to the relational one. In this model some connections are established in advance but not all. The conceptual difference of "consists of" and "has" is explicitly expressed in this model. The model is claimed to be faster than the relational one. However, no commercial products are available at this moment. After this summary concerning file systems and databases, we will explain some more details about relational databases. One of the unpleasant habits in relational database theory is that all things got other names than they had before. To start with the name "relation": it does not mean the relation between entities, but the term is used in the mathematic sense. A mathematical relation is something like a table (figure 7). It has a heading with variable names, which are called attributes in relational database terminology. The rows in the table are called tuples instead of rows or records. A specific value for a variable in a row is called attribute value. Most articles about relational databases use the terms "table" and "relation" to indicate the same thing. The terms "variable" and "record" are mostly avoided. n RDBMS Tables Tuples Attb n ut es Attribute values PARENTS I (fixed format files) (records) (vana . bl es) (data) ~inilY Member Gender Birthdate 003003007007- 01 02 01 02 male female male female 21-10-47 20-09-50 07-08-49 09-09-52 Figure 7 The information about the data. the metadata, are also stored in tables (figure 8). 533 Information about the data (attribute names etc.) Metadata is also stored in tables: datadictionary. COlS Table-name Col-name Parents Parents Parents Parents Family Member Gender Birthdate = metadata Type Format Number Number Char Date F5.0 F2.0 A10 DD-MM-YY Figure 8 The Structured Query Language provides an extensive number of facilities to create new tables (data definition) to insert tuples into tables, to update table content to select information from one or more tables and to recode information (data manipulation). SQL is therefore a data definition language as well as a data manipulation language. Information is selected with the SQL SELECT statement which consists basic1y of three parts: the select part, the from part and the where part. SELECT varl,var2, .. varn FROM table_names WHERE condition. Figure 9 shows a simple query and figure 10 shows a query where two tables are combined. SELECT FROM WHERE birthdate parents gender = 'FEMALE' Figure 9 534 SELECT FROM WHERE parents. weight, children. weight, children. age parents, children parents. family = children. family PARENTS Family Member Weight 003 003 01 02 70 59 RESULT: Par. weight Child. weight 70 59 10 10 CHILDREN f-F_a_m_il_Y_M_e_m_b_e_r_ _A_g_e_ _w_e_ig_h--1t 003 03 01 10 Child. age Figure 10 A strong feature of SQL is the view which is a virtual table. In fact it is a select statement but to the user it seems like a table. Views can be created from more than one original table. A view is automatically updated when the underlying tables are updated. Figure 11 shows a view which is based on two tables, namely BASELINE INFO and LABORATORY. The birthdate of BASELINE)NFO is combined the investigation_date of LABORATORY to compute the age. There are now many popular books in many languages which introduce the novice to the practical application of SQL. The number of books concerning relational database design is limited and are mostly h! English (2). 535 BASE LI N E-I NFO LABORATORY View: dynamically updated Figure 11 536 The Interface. As has been mentioned in the introduction, an inventory has been made of problems which could arise when relational databases would be used for research and management, specially in medicine and agriculture. Some of these points where directly related to the relational model and other points where more considered as a wishlist of the participants. In addition, some statistical packages have a large number of data manipulation facilities (i.e. SAS), but other packages require that all data manipulation has been performed in advance, such as the stand alone versions of HOMALS, PRlNCALS and CANALS before they were incorporated into SAS. Most features for recoding and reshaping the data, as published in the SAS applications guide, suppose sequential aspects of the data file. These sequential aspects are absent in RDBMSs. There is no "first observation" if this information is not explicitly stored. Therefore we decided to move all these recoding and reshaping activities to the database. Most of these activities could be directly performed with SQL, but some other problems, such as the clinical visit problem (SAS applications guide page 9), needed a special approach, which we called denormalization. THE RDBMS contains more information than only data. It contains attribute descriptions, such as "numeric" or "character" and "field length" . In addition information about value labels and variable labels is often present in the database. SAS uses the term "format" for value labels. These metadata should be rooted to the statistical package together with the data. One of the mayor points of the wishlist was statistical syntax generation. Novice researchers need about two month to learn to work with statistical packages. It takes more time if more than one package is involved. Furthermore much time is spent to correct syntax errors, such as omitted points in BMD-P and omitted semicolons in SAS. The introduction of user friendly statistical packages on personal computers prompted for a user friendly interface on super mini and mainframes. A specific problem related to databases was the dynamic aspect of the database. In a hospital, data entry cannot be stopped because some researchers like to perform some analyses. However, if data entry is not stopped, all frequency tables will have different totals, which is often not accepted by article referees. Therefore a snapshot facility was needed. The summary of problems and wishes is not complete, but the most important ones have been mentioned here. After the inventory of problems and wishes, a datamodel could be made and a prototype interface has subsequently been constructed. The specific and interesting construction problems will be published elsewhere. In this article we will concentrate on the result, being the function of the prototype interface. The prototype interface is menu driven. There are four main options (figure 12), namely I] loading and updating of the research base, 2J data manipulation, 31 selection of statistical routine and 4J 537 loading and updating of statistical syntax and generation of screens. The interface is built based on the RDBMS ORACLE. I Update research base I Data manipulation MENU I i Statistics i Update syntax base Figure 12 The research base is loaded with a snapshot from a dynamic changing database or from datafiles of system files. When the database is stable (after the last correction) it is not needed to make a snapshot and the tables can be used directly. but this has to be explicitly confirmed (figure 13). UPDATE RESEARCH BASE - start new project drop research project select information from previous research load research base from files (data files, system files) make snapshot from database Figure 13 538 The second option of the main menu is the data manipulation (figure 14). All recoding, reshaping, merging and updating as mentioned in the SAS application guide can be performed by this option. A number of features are implemented to cope with repeated measurements, recorded on multiple tuples, such as the clinical visit problem mentioned earlier (figure 15). The only data manipulation aspects which are not supported are those which are not allowed in RDBMSs as repeated fields. DATA MANIPULATION - Structured Query language (SQl) Variable lables Value lables (formats) Recode Missing values View Aggregate (denormalize) iii time, intervals and events. (repeated measurement facility) Figure 14 REPEATED MEASUREMENTS FACILITY id date systolic 01 01 01 01 02 02-07-80 12-09'91 22-12-81 23-01-82 02-07-80 114 118 120 120 130 id syst 1980 syst 1981 syst 1982 ~ 01 114 ,130 02 Figure 15 539 119 120 The third option is the selection of the statistical procedure. Many nested menus can be built behind this option. After filling in the specifications for the analysis, the interface selects the appropriate package and the appropriate procedure. Syntax will be generated according to the package and procedure selected. When two packages could be used, the package of preference is used. Each researcher can specify his own preference. The fourth option provides the user or statistical expert the possibility to incorporate new packages and new procedures in the interface (figure 16). First tables are to be generated (with SQL) for the model and methods of the procedure in general, independent from packages. Subsequently, screens can be generated automatically (FASTFORM facility) or by an form editor (SQL*FORMS). The third step is the specification of the package possibilities for the procedure. Some packages produce different statistics. A kappa value can be produced in a frequency procedure by one package but not by the other. The fourth step is the specification of syntax. After these four steps the new procedure can be used. A lot of checks can be incorporated in the screens, to prevent the generation of wrong syntax. Building of these checks is the most time consuming part but pays off later. UPDATE SYNTAX BASE - Add/modify package specifications Create/update tables for procedures Create/update screens for procedures Create/update package-procedures specifications and limitations - Create/update package-procedure syntax Figure 16 I I ~ " ~ I ~ 540 Discussion. This prototype is intended as a general prototype, where all SQL based relational databases are connected to all command· file driven statistical packages. However, in practice one has to limit oneself to produce a prototype at all. Oracle has been chosen as ROBMS because it works on a large range of machines and Oracle intends to adapt their tools to DB2 of IBM, which is also an SQL based relational database. Since Oracle and DB2 cover together most of the ROBMS market, we believe that this was the right approach at the RDBMS side of the interface. Concerning the statistical side of the interface, we tried to make it as wide as possible. Many new statistical techniques are not implemented yet in SAS. How long did it take until HOMALS, CANALS and PRINCALS were incorporated? It is a question of time, finance and market. In practice most researchers use one or two packages together with some stand alone programs. Our approach provides the possibility to incorporate all these statistical facilities in one single shell. This shell can be called an expert system, but that depends on what criteria are used for the term expert system. A lot of expertise is incorporated in the shell. However, no special forward chaining or backward chaining routines are used to accomplish the task in less time. Loading of. the database with statistical expertise is at the moment the most time consuming part. It depends on what researchers want. If they use only a twenty different procedures, these screens and syntaxes can be tailored to their wishes. Tailored loading costs about two to six weeks, but loading of all SAS procedures will take more time. This is also a question of market wishes. The prototype has now been constructed and functions quite well. However, it is a nearly empty shell, in the sense that most of the statistical procedures are not implemented yet. A shared effort of a group of users would reduce the cost for filling the database with statistical expertise. References :$ I) ORCE Systems. The relational database management system. NGI The Netherlands! ORCE The Netherlands. 1985 2) Perkinson RC. Data analysis, the key to database design. QED Information services INC. Wellesley MA. USA. Elsevier Science Publishers Amsterdam. Second Edition 1985. ~~ ~ ~ r;:, ,": [, [: f t~: t' ff. ~ ~ ~ , ~. r ~. "t r: ~ j t ~ ~ ~ i 541 ;.