Download Rocheford Research on Creating a Human Environment for On-Line Research Tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Data analysis wikipedia , lookup

Data model wikipedia , lookup

3D optical data storage wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

SQL wikipedia , lookup

PL/SQL wikipedia , lookup

Versant Object Database wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Database wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Business intelligence wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
the ROC H E FOR T project
A General Interface between Relational Databases
and
command file driven statistical packages.
Leo K.J. van Romunde 1), Tom P.H. Troquay 1), Rien A. Hilhorst 2)
1)
Emsmus University Rotterdam. Department of Epidemiology.
Postbox 1738. NL 3000 DR Rotterdam. The Netherlands.
2)
Ministry of Agriculture and Fisheries of the Netherlands.
Organization and Efficiency Department. Postbox 20401.
NL 2500 EK The Hague. The Netherlands. .
Summary
ROCHEFORT is a research project with the aim to develop an
integrated and intelligent information system with the following
topics:
1]
a genera.. interface between relational database management
systems (RDBMSs) and statistical packages
2]
exploiting the strength of both by allowing
- the RDBMS to manipulate the data
- the statistical package for statistical analyses
3]
with features for transformation
calculations and aggregation of data
4]
and easy automatic syntax generation (statistical command file
and job control language)
5]
offering an unified user front end
6]
storing all relevant information and knowledge in a database.
on
data
structures,
This concept will offer expert system capabilities in the future.
The first prototype is developed by means of the Structured Query
Language (SQL) based RDBMS ORACLE and will interface to the
SAS package. Initial implementation has been on a Digital VAX
under VMS, but portability is foreseen.
527
Introduction
Relational Database Management Systems (RDBMSs) become more
and more important in data processing environments. Probably
RDBMSs will replace most of the file handing in the next two
decades. This process is called the silent revolution (I).
Statisticians cannot neglect this revolution. In order to cope
with future statistical data processing they have to learn the basic
principles of RDBMSs and the language SQL (Structured Query
Language) which is implemented in the better relational software
products. The SQL language, where sets of records can be selected
or updated in one statement, is a fourth generation language, in
contrast to the third generation languages as PLII Fortran and C
where extensive programs have to be written in order to deal with
sets of records and record linkage.
However, the statistician has also to cope with the limitations
(constraints) of RDBMSs which are implemented to guarantee data
integrity at the moment of data entry or data correction.
Therefore, more data manipulation is needed before analysis
compared to the. conventional data storage methods.
RDBMSs contain not only data but also information about the
data, the so called metadata. In addition, RDBMSs can be loaded
with information about statistical packages and procedures. Even
the syntax of statistical packages and procedures can be stored in
the database. That information can be used to select, menu driven,
the appropriate package and procedure and to generate the
command and datafile for subsequent analysis.
The problems to use RDBMSs for production and research have
been inventoried two years ago by a group researchers from the
Erasmus University and concurrently by a group from the Ministry
of Agriculture and Fisheries. The database company Oracle brought
both groups together and a task force has been formed to produce
a general interface between RDBMSs and statistical software.
In this article we will first explain some aspects of relational
databases in contrast to file systems and other database systems.
Subsequently we will explain the functions of the interface and
finally we will discuss the product in relation to SAS.
File systems and databases.
All SAS users are accustomed to use files or datasets. A conceptual
difference between files and datasets is only made in IBM manuals,
where a file is internal to the program and a dataset is a name for
physical storage of data and programs on tape or disk and where
the Job Control Language is used to connect files to datasets.
However, in most other operating systems, datasets are directly
accessed from the program, and a difference between files and
datasets is meaningless in such cases. In this article we will use
the term file instead of dataset.
Files can be categorized into fixed length files and variable
length files (see figure I). Most statisticians did their first
computer analyses on fixed files, which a fixed number of variables
528
on a limited number of observations. Variable length files are often
used for repeated measurements, as for instance repeated blood
pressure measurements (figure 2).
SEQUENTIAL FIXED
0001 M 21-10-48
0002 F 12-05-52
0003 F 07-09-30
Figure 1
SEQUENTIAL VARIABLE
0001 M 21-10-48 12-12-75 090/130 12-2-78 095/1401
0002 F 12-05-52 17-12-75 080/132\'-_ _ _ _ _ ,
0003 F 07-09-30 19-12-75 085/140 19-2-78 110/160l
Figure 2
Files can also be categorized according the how they can be
addressed. Most files are sequential. Figure 3 shows first a disk
with a sequential file. The information is stored as a sequence.
Reading and writing starts always at the begin of the file. This
way of information storage can be inefficient for data entry or
data correction. Therefore the file can be subdivided into regions
(figure 3: regional) and reading and writing can be started at a
region boundary. Another way to optimize information retrieval is
to work like a book with sequential information and an index.
Searches are first performed on the index and subsequently on the
data (figure 3: indexed sequential). This is the index sequential
method.
529
SEQUENTIAL
REGIONAL
INDEXED SEQUENTIAL
Figure 3
A lot of technical and organizational knowledge was needed to
build information systems for factories, financial institutions,
hospital etc. with these ingredients. Program and data maintenance
problems were already great in the late sixties, when computers
were introduced for administrative work. A new term has been
introduced namely "databases". Databases should be less hardware
dependent and the same information should not be stored twice. In
addition, the metadata (information about the data) should be
stored in the database as well.
Four types databases have been produced since that time. The
first type was the hierarchical database, where information was
structured like a tree (figure 4). This database type functioned
quite well when information was retrieved according to the tree. In
figure 4 we see a hospital information tree. The basic root is the
hospital. Smaller roots are the patients. The smallest roots are the
visits and the leaves are the examinations. Information about the
examination on uric acid on 20-may-8J of patient X can be found
quickly, but to collect information about high uric acid and renal
stones is complicated and requires a sequential search through all
leaves. In addition not all information can be structured into trees.
Even a genealogy tree, which is rather tree like, is not a real tree,
and cannot be structured according to an hierarchical model.
f,3D
_____Exam
Hierarchical DB
Figure 4
The constraint of the tree structure was released by the
introduction of network databases. Network databases provided
facilities to interconnect information without tree structure.
However, the interconnections had to be known in advance. They
formed the network between the basic information (entities) (figure
5). Conceptual, not technical rules were provided to form the
entities from the data. The rules were called the NORMAL rules
and they should help to maintain data integrity. As mentioned
before data should not be stored twice , because of possible
correction problems, where one version could be corrected and the
other not.
The fact that interconnections had to be known in advance was
problematic in several circumstances, specially where production
data had also to be used for management information systems or
research systems. Queries for management and research are often
market dependent and therefore unpredictable. This lead to the
construction and introduction of the relational database type.
531
Therapy
Visits
Network DB
Figure 5
Relational databases have only entities (or objects) and the
connection between these objects is established at the moment of
the query (figure 6). This makes relational databases very flexible
at some expense of speed. In fact one can forget about the
physical storage, about trees and even about al possible
interconnections. The important things which remain are mainly
conceptual, namely the construction of objects, the normal forms
and the constraints for data integrity and relational integrity
(integrity between objects). Information in different objects can
easily be joined using the already mentioned Structured Query
Language (SQL).
(no permanent connections)
Patients
Visits
Examinations
Therapy
Relational DB
Figure 6
532
The newest development in this field is the semantical
database. The semantical database resembles most to the relational
one. In this model some connections are established in advance but
not all. The conceptual difference of "consists of" and "has" is
explicitly expressed in this model. The model is claimed to be
faster than the relational one. However, no commercial products are
available at this moment.
After this summary concerning file systems and databases, we
will explain some more details about relational databases. One of
the unpleasant habits in relational database theory is that all things
got other names than they had before. To start with the name
"relation": it does not mean the relation between entities, but the
term is used in the mathematic sense. A mathematical relation is
something like a table (figure 7). It has a heading with variable
names, which are called attributes in relational database
terminology. The rows in the table are called tuples instead of rows
or records. A specific value for a variable in a row is called
attribute value. Most articles about relational databases use the
terms "table" and "relation" to indicate the same thing. The terms
"variable" and "record" are mostly avoided.
n
RDBMS
Tables
Tuples
Attb
n ut es
Attribute values
PARENTS
I
(fixed format files)
(records)
(vana
. bl es)
(data)
~inilY
Member
Gender
Birthdate
003003007007-
01
02
01
02
male
female
male
female
21-10-47
20-09-50
07-08-49
09-09-52
Figure 7
The information about the data. the metadata, are also stored in
tables (figure 8).
533
Information about the data (attribute names etc.)
Metadata is also stored in tables: datadictionary.
COlS
Table-name Col-name
Parents
Parents
Parents
Parents
Family
Member
Gender
Birthdate
= metadata
Type
Format
Number
Number
Char
Date
F5.0
F2.0
A10
DD-MM-YY
Figure 8
The Structured Query Language provides an extensive number
of facilities to create new tables (data definition) to insert tuples
into tables, to update table content to select information from one
or more tables and to recode information (data manipulation). SQL
is therefore a data definition language as well as a data
manipulation language. Information is selected with the SQL
SELECT statement which consists basic1y of three parts: the select
part, the from part and the where part.
SELECT varl,var2, .. varn FROM table_names WHERE condition.
Figure 9 shows a simple query and figure 10 shows a query where
two tables are combined.
SELECT
FROM
WHERE
birthdate
parents
gender = 'FEMALE'
Figure 9
534
SELECT
FROM
WHERE
parents. weight, children. weight, children. age
parents, children
parents. family = children. family
PARENTS
Family
Member
Weight
003
003
01
02
70
59
RESULT:
Par. weight
Child. weight
70
59
10
10
CHILDREN f-F_a_m_il_Y_M_e_m_b_e_r_ _A_g_e_ _w_e_ig_h--1t
003
03
01
10
Child. age
Figure 10
A strong feature of SQL is the view which is a virtual table. In
fact it is a select statement but to the user it seems like a table.
Views can be created from more than one original table. A view is
automatically updated when the underlying tables are updated.
Figure 11 shows a view which is based on two tables, namely
BASELINE INFO and LABORATORY. The birthdate of
BASELINE)NFO is combined the investigation_date of
LABORATORY to compute the age.
There are now many popular books in many languages which
introduce the novice to the practical application of SQL. The
number of books concerning relational database design is limited
and are mostly h! English (2).
535
BASE LI N E-I NFO
LABORATORY
View: dynamically updated
Figure 11
536
The Interface.
As has been mentioned in the introduction, an inventory has been
made of problems which could arise when relational databases would
be used for research and management, specially in medicine and
agriculture. Some of these points where directly related to the
relational model and other points where more considered as a
wishlist of the participants. In addition, some statistical packages
have a large number of data manipulation facilities (i.e. SAS), but
other packages require that all data manipulation has been
performed in advance, such as the stand alone versions of
HOMALS, PRlNCALS and CANALS before they were incorporated
into SAS.
Most features for recoding and reshaping the data, as published
in the SAS applications guide, suppose sequential aspects of the
data file. These sequential aspects are absent in RDBMSs. There is
no "first observation" if this information is not explicitly stored.
Therefore we decided to move all these recoding and reshaping
activities to the database. Most of these activities could be directly
performed with SQL, but some other problems, such as the clinical
visit problem (SAS applications guide page 9), needed a special
approach, which we called denormalization.
THE RDBMS contains more information than only data. It
contains attribute descriptions, such as "numeric" or "character" and
"field length" . In addition information about value labels and
variable labels is often present in the database. SAS uses the term
"format" for value labels. These metadata should be rooted to the
statistical package together with the data.
One of the mayor points of the wishlist was statistical syntax
generation. Novice researchers need about two month to learn to
work with statistical packages. It takes more time if more than one
package is involved. Furthermore much time is spent to correct
syntax errors, such as omitted points in BMD-P and omitted
semicolons in SAS. The introduction of user friendly statistical
packages on personal computers prompted for a user friendly
interface on super mini and mainframes.
A specific problem related to databases was the dynamic aspect
of the database. In a hospital, data entry cannot be stopped
because some researchers like to perform some analyses. However,
if data entry is not stopped, all frequency tables will have different
totals, which is often not accepted by article referees. Therefore a
snapshot facility was needed.
The summary of problems and wishes is not complete, but the
most important ones have been mentioned here. After the inventory
of problems and wishes, a datamodel could be made and a prototype
interface has subsequently been constructed. The specific and
interesting construction problems will be published elsewhere. In
this article we will concentrate on the result, being the function of
the prototype interface.
The prototype interface is menu driven. There are four main
options (figure 12), namely I] loading and updating of the research
base, 2J data manipulation, 31 selection of statistical routine and 4J
537
loading and updating of statistical syntax and generation of screens.
The interface is built based on the RDBMS ORACLE.
I
Update
research
base
I
Data
manipulation
MENU
I
i
Statistics
i
Update
syntax
base
Figure 12
The research base is loaded with a snapshot from a dynamic
changing database or from datafiles of system files. When the
database is stable (after the last correction) it is not needed to
make a snapshot and the tables can be used directly. but this has
to be explicitly confirmed (figure 13).
UPDATE RESEARCH BASE
-
start new project
drop research project
select information from previous research
load research base from files (data files, system files)
make snapshot from database
Figure 13
538
The second option of the main menu is the data manipulation
(figure 14). All recoding, reshaping, merging and updating as
mentioned in the SAS application guide can be performed by this
option. A number of features are implemented to cope with
repeated measurements, recorded on multiple tuples, such as the
clinical visit problem mentioned earlier (figure 15). The only data
manipulation aspects which are not supported are those which are
not allowed in RDBMSs as repeated fields.
DATA MANIPULATION
-
Structured Query language (SQl)
Variable lables
Value lables (formats)
Recode
Missing values
View
Aggregate (denormalize)
iii time, intervals and events.
(repeated measurement facility)
Figure 14
REPEATED MEASUREMENTS FACILITY
id
date
systolic
01
01
01
01
02
02-07-80
12-09'91
22-12-81
23-01-82
02-07-80
114
118
120
120
130
id
syst 1980 syst 1981 syst 1982
~ 01
114
,130
02
Figure 15
539
119
120
The third option is the selection of the statistical procedure.
Many nested menus can be built behind this option. After filling in
the specifications for the analysis, the interface selects the
appropriate package and the appropriate procedure. Syntax will be
generated according to the package and procedure selected. When
two packages could be used, the package of preference is used.
Each researcher can specify his own preference.
The fourth option provides the user or statistical expert the
possibility to incorporate new packages and new procedures in the
interface (figure 16). First tables are to be generated (with SQL)
for the model and methods of the procedure in general, independent
from packages. Subsequently, screens can be generated automatically
(FASTFORM facility) or by an form editor (SQL*FORMS). The third
step is the specification of the package possibilities for the
procedure. Some packages produce different statistics. A kappa
value can be produced in a frequency procedure by one package but
not by the other. The fourth step is the specification of syntax.
After these four steps the new procedure can be used. A lot of
checks can be incorporated in the screens, to prevent the
generation of wrong syntax. Building of these checks is the most
time consuming part but pays off later.
UPDATE SYNTAX BASE
-
Add/modify package specifications
Create/update tables for procedures
Create/update screens for procedures
Create/update package-procedures
specifications and limitations
- Create/update package-procedure syntax
Figure 16
I
I
~
"
~
I
~
540
Discussion.
This prototype is intended as a general prototype, where all SQL
based relational databases are connected to all command· file driven
statistical packages. However, in practice one has to limit oneself
to produce a prototype at all. Oracle has been chosen as ROBMS
because it works on a large range of machines and Oracle intends
to adapt their tools to DB2 of IBM, which is also an SQL based
relational database. Since Oracle and DB2 cover together most of
the ROBMS market, we believe that this was the right approach at
the RDBMS side of the interface.
Concerning the statistical side of the interface, we tried to
make it as wide as possible. Many new statistical techniques are
not implemented yet in SAS. How long did it take until HOMALS,
CANALS and PRINCALS were incorporated? It is a question of
time, finance and market. In practice most researchers use one or
two packages together with some stand alone programs. Our
approach provides the possibility to incorporate all these statistical
facilities in one single shell. This shell can be called an expert
system, but that depends on what criteria are used for the term
expert system. A lot of expertise is incorporated in the shell.
However, no special forward chaining or backward chaining routines
are used to accomplish the task in less time.
Loading of. the database with statistical expertise is at the
moment the most time consuming part. It depends on what
researchers want. If they use only a twenty different procedures,
these screens and syntaxes can be tailored to their wishes. Tailored
loading costs about two to six weeks, but loading of all SAS
procedures will take more time. This is also a question of market
wishes.
The prototype has now been constructed and functions quite
well. However, it is a nearly empty shell, in the sense that most of
the statistical procedures are not implemented yet. A shared effort
of a group of users would reduce the cost for filling the database
with statistical expertise.
References
:$
I)
ORCE Systems. The relational database management system.
NGI The Netherlands! ORCE The Netherlands. 1985
2)
Perkinson RC. Data analysis, the key to database design. QED
Information services INC. Wellesley MA. USA. Elsevier Science
Publishers Amsterdam. Second Edition 1985.
~~
~
~
r;:,
,":
[,
[:
f
t~:
t'
ff.
~
~
~
,
~.
r
~.
"t
r:
~
j
t
~
~
~
i
541
;.