Download Interfacing Normalized Relational Database Structures with SAS Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Oracle Database wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
INTERFACING NORMALIZED RELATIONAL DATABASE STRUCTURES WITH SAS@SOFTWARE
James R. Johnson, Glaxo Inc.
Roger D. Cornejo, Glaxo Inc.
Abstract
(DML) for Selecting, Inserting, Updating, and
Deleting rows and columns of data from a
relational database.
SQL is also a data
definition language (DDL) for creating and
deleting database objects (table1s, views,
indexes, synonyms, etc.). Additionally, SQL is a
data control language (DCL) where user access to'
the database and its tables and views are granted
(Grants).
SQL is not a procedural language since
it does not offer programming constructs such as
logical sub setting with IF-THEN-ELSE or CASE
statements, or looping with DO-END, DO-WHILE, or
DO-UNTIL programming constructs. SAS® Software,
in contrast, does have a procedural language
component in the DATA step that supports logical
and looping programming constructs. The library
of SAS® Software procedures is internationally
accepted as one of the most comprehensive tool
sets of pre-packaged analysis routines available
for application developers and end-users.
SAS/ACCESS® software specifically offers an
interface to relational databases, thus providing
the application developer and end-user with a
tool for taking advantage of the strengths of
relational database technology and the procedural
language components of the SAS System to solve
various business needs.
The use of relational database (RDBMS) technology
and different levels of normalization (1st, 2nd,
3rd, 4th normal data structures) is proliferating
throughout the data processing industry. RDBMS
systems are valued for their ability to maintain
the integrity of data, reduce unnecessary data
redundancy, and provide maximum flexibility in
retrieval. At the same time SAS@ software is
established as the general fourth generation
language
(4GL)
tool
for
data analysis
and
reporting. Clearly, the use of relational
databases and SAS® software should be included
in the ideal tool kit for systems development.
Use of the different and distinct forms of
normalized data structures, within an RDBMS,
impacts the creation and use of SAS data sets.
Well designed ROBMS applications typically result
in normalized tables that remove repeating data
groups, duplicated data, and establishes
appropriate key to data associations within a
table. Many of the powerful features in the SAS
system utilize data sets that are not normalized,
thus presenting several interfacing problems.
This dichotomy in each products usage need not
force a compromise in ~ach products strengths. In
this paper we will discuss four different
normalization forms used in relational databases,
as well as presenting methods of interfacing with
three normalized data structures using SAS®
software.
It is the intent of this paper to discuss the
varying degrees of a normalized database, through
a real world example, and how the use of FROC
SQL, the DATA step and SAS System procedures call
effectively be used to access and analyze
information stored in these data structures.
Introduction
Throughout the 1980's the implementation of
relational theory in the design and development
of database management systems has been evident
with the introduction of such products as DEC's
RdB®, IBM's DB2®, and Oracle Corporation's
ORACLE RDBMS®. Each of these database engines
utilize the basic relational principle that
tables of information can be defined which obey
constraints, called relations, as outlined by
Date (1981). Relational database technology
offers application developers and end-users with
a tool that can improve the usefulness and
quality of data by organizing the information
into varying degrees of normalized tables which
will better meet the business needs of the
application being developed.
Normalization
Generally speaking, normalization is a formal
process of organizing data into table~
(relations) of logically related information that
satisfy conditions defined for the various normal
forms (Kemm 1989).
The normalized database
environment is a means by which an organization
can:
Provide a base of all data elements relevant
to the business requirements.
Provide a processing environment that makes
these data elements easily available to all
appropriate users, both current and future.
Ensure data integrity in all of the data
elements.
Provide a stable, reproducable, highly
flexible, and standardized data architecture
to meet the clearly defined business needs.
Optimize the performance of the database.
One of the problems that has persisted throughout
the development of relational technology is the
lack of procedural data analysis and reporting
tools for application developers and end-users to
interface with a relati.onal database. SQL is the
industry standard data manipulation language
421
In poorly designed relational database
applications (non-normalized) the developer and
end-user will often find that some or all of the
following anomalies may take place.
how the rules of normalization can be applied as
well as demonstrate access to the data via PROC
SQL and/or the DATA step. Access to this type of
table structure is as simple as creating a
descriptor file and view member of a catalog
using SAS/ACCESS@ Software. The SAS/ACCESS to
Oracle procedure was utilized for this paper. The
rules and presentation could, however, be
generalized for use with access to the Rdb and
DB2 relational data base environments.
un-related data elements may be placed into
tables together.
data elements may appear as repeated groups
of information (e.g. un-indexed arrays of
data elements)
data element values may be stored as null or
blank for selected types of rows in the
table.
referential integrity problems may exist that
can result in a corrupt database.
Table 2. Non-NoUlalized Data Table •.
Non-Normalized Data Table for Drug Survey Example.
information is contained in one data table.
It is the goal of the normalization process to
reduce and in most cases eliminate these features
of poorly designed applications. Normalization
techniques were created to provide database
designers with a methodology for detecting and
preventing these problems. Normalizing an
application will remove un-controlled redundancy
of data elements, reduce the amount of structural
changes to table designs, and favor the
transaction process for inserting, updating, and
deleting rows in a table.
Null?
*PATIENT
*VISIT
VISIT_DATE
No repeating groups of data elements.
All non-key columns dependent upon the entire
key.
3NF
All non-key columns dependent on the enti.re
key and have no other dependencies. No Donkey column is dependent on another non-key
column.
'NF
No row i.n a table has more than one multivalued d.ependency.
CRAR(6)
NUMBER
MARTIAL_STATUS
TEST_l
TEST_2
TEST 3
CHAR(7)
CHAR(6)
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
CBAR(1)
NOT NULL NUMBER
CBAR(7)
NUMBER
BP_DYS
NUMBER
NUMBER
lIEART RATE
NUMBER
BP_SYS
* ..
primary key
This non-normalized table structure looks similar
to the "file" structures historically utilized to
store and retrieve information on a particular
application. In the example, all repeating
groups, infODmation regarding individual visits,
and information about the investigator are
physically stored together in a single row of the
table. Also note that redundant information on
data elements collected once per patient are
proliferating throughout the table. Access to
this type of non-normalized table can be
accomplished with the following simple PROC SQL,
DATA step, or procedural examples. Note that in
each case the SAS statements are reading directly
from an ORACLE database table to complete the
task requested. The FROC SQL example will
automatically read and display the information
requested in a single step, with no intermediate
data set being fODmed.
Rule for Existence
2NF
SEX
AGE
AGK_ONITS
TEST_4
TEST_S
Table 1. states of Normalization in a Relational databas<:I. *
1NF
Type
NOT NULL NUMSER
NOT NULL NUMBER
NOT NULL DATE
CHAR{~)
Many developers have arrived at INF, 2NF, 3NF
database designs just by applying a common senSE
understanding of the data elements which are to
be stored in their applications. Nevertheless,
there are distinct levels of normalization and
rules for how each level can be applied to meet
the business needs of the database being
designed. Table 1. lists the various states of
normalization to be discussed in this paper, as
well as a simple rule for its existence.
NOODalization
Form
All
Other normali~ation states have been defined. but are not
important to tbe discussion in this paper.
Note that the naming conventions in the
programming statements are different than the
naming of columns in the respective tables.
SAS/ACCESS@ Interface to Oracle software was
used to create descriptor files to the databases~
Naming conventions in the access procedures must
follow traditional 8 character naming conventions
used historically in SAS software.
Non-Normalized Data Structure
A sample, single table, database that does not
apply the rules of nonnalization defined in Table
1 is presented in Table 2. The sample data set
for this table, as a line listing, is presented
in Appendix 1. This example will be used to show
422
libname oracledb
The normalization process will generate tables
that generally favor update processing over the
query process. Sometimes this is not a desirable
situation for the application. For example, in
tables utilized for query only, that are built
and maintained by a loading utility, lNF
representation may be sufficient to support the
application. It is often desirable in query-based
applications to de-normalize the table for
performance reasons (e.g. increase the access
speed to the information). This can result in an
increase in the redundancy of information in the
table, albeit planned redundancy.
to [ ) to;
proc Itql,
Itelect '" from oracledb. alldata;
where patient '" 10;
quit:
or
proc print dat ...-oracledb alldata;
Where patient .. 10;
The DATA step can optionally be utilized to read
directly from an ORACLE database table or view to
create a traditional SAS data set.
Table 3 presents a representation of the sample
database where the repeating groups of test
scores are moved into another Lable, thus
creating a two table database environment. Notice
that the TEST_SCORES table assigns the test
number to an individual column while the
cOLresponding test result is placed in the
results column. The keys patient, visit, and test
number are utilized to uniquely identify each
record in the TEST_SCORES table.
data test;
set oracledh. alldata;
if patient .. 10;
Clearly, the non-normalized data structure is not
desirable for the problems that are associated
with update and maintenance or the table
(including the addition of a new test), excess
and redundant storage of information (see
appendix 1), an,d the poor performance that would
be seen when sorting and querying the example.
This sample data -set, along with almost every
type of data storage and retrieval application
can be analyzed and stored within a relational
database management system using normalization
techniques that follow in the discussion.
Table 3. First Normal Form (lNF)
INF Data Tables for Drug Survey Example. Repeated groups of
test scores are partitioned out into a test scores table. All
other information resides in the patient table.
Table Name: PATIENTS
Null?
First Normal Form (1 NF) Data Structure
*PA'1IEN'1
*VISI'1
VISIT_DATE
..x
First no~l form (INF) provides for no repeating
groups of information. This means that no array
~tructure~
(e.g. Al, A2, A3, . . . ) .::Ihould be
utilized in the table design. It also means that
all repeating groups of information are moved
into their own separate table (along with the
primary key columns) and values for Al, A2, A3
are stored as rows rather than columns. The
primary reason for this form is that in most
cases it is difficult to predict the number of
repeating elements that may be utilized, and
adding another row to a table is trivial, while
adding another column is not an elementary task.
Additionally, if we need to perform operations on
the values for AI, A2, A3, ___ in the nonnormalized table the operation needs to be
duplicated for each repeat group rather than
coded once (e.g. Al*10, A2*lO, A3*lO, ••. , versus
A*lO)
NO'1 NULL NUMBP.R
NOT NULL HUMBER
NOT NULL DATE
CHAR (6)
NUMBER
CHAR(6)
RACE
MAR'1IAL_ S'1ATUS
STUDY DRUG
CHAR(l)
CRAR(1)
cHAR(l)
NO'1 NULL NUMBER
CHAR (1)
NUMBER
NUMBER
BP_SYS
BP_DYS
HEAR'1_AA'1E
* '"
NUMBER
NUMBER
primary key
Null?
*PATIEN'1
*VIS:tT
*TES'1_NUM
TEST_VALUE
* '"
On occasion INF can be violated when the number
of repeating elements is fixed either by
definition or business rule (e.g. fixed monthly
columns such as JAN, FEB, MAR, . • . ). Denormalization may also. at times be utilized for
performance reasons to create valid INF
violations. However, very clear, defined business
or performance rules must be predefined to
control the violation of INF database
architecture.
Type
NO'1 NULL NUMBER
NO'1 NULL NUMBER
NUMBEl<
NUMBER
primary key
Access to the collective set of information in
the INF patient and test scores tables is
completed via a simple join of the tables using
the SQL join construct in the WHERE clause. The
SORT procedure and the DATA step may also be
utilized to collectively evaluate the join of the
two tables in a similar fashion. However, using
traditional SAS programming constructs will
create intermediate data sets in either a
423
permanent or work library, dependending upon how
they are specified. Clearly, the use of the SQL
procedure to join and display the information in
a single step provides a more robust single step
processing environment than the multi-step
procedures and DATA step environment.
rules of 2NF. A measurements table that uniquely
identifies the visit# and associated measurements
collected at a particular visit, is created to
eliminate the problems of redundant data seen in
Appendix 1. All data elements that are collected
once and are visit independent are placed in the
patients table.
For purposes of discussion the ORDER BY clause is
a similar statement to the SORT procedure and the
WHERE clause completes the same function as a
match MERGE/BY statement syntax in the DATA step.
The WHERE clause can be extended to generate
similar results as a sub-setting IF statement in
the DATA step.
Table 4.
Second Normal Form (2NF)
2NF data tables for Drug Survey Example. Information not
associated with a visit is partitioned out into a patient
demog table, while information that is visit dependent is
placed in a patient measure table. The test scores table is
maintained from 1MP.
Table Name,PATIENT_DeMOG
Access to the 1NF database tables via PROC SQL is
as follows:
Name
Null?
*PATIENT
CaAR(6)
SRX
libname oracledb .. 11 " ;
proc sql;
select a .patient,b. visit, a. invnum, a. sex,
a.race, b.testnum, b.result
from oracladb. fnpat a, oracledb fnscore b
where a.patient '"' b.patient
and
a. visit = b. visit
order by patient, visit;
Type
NOT NULL NUMBE:R
AGE
NUMB. .
AGE_UNITS
CtlAl\(6)
RACE
CHAR(l)
MARTIAL_STATUS
STUDY_D:RUG
rNV_NUM
INV_NAMe
INV paONE_EXT
CHAA(7)
CHAF(l)
NOT NULL NUMBER
CtlAl\(7)
NUMBE:R
* '" primary key
quit;
Table Name, PATIENT_MEASURE
Access via traditional SAS programming constructs
would be as follows, assuming that the data are
not stored in sorted order.
Name
Null?
*PATIENT
*VISIT
VISIT_DATE
BP SYS
BP_DYS
HEART_RATE
proc sort data=oracledb fnpat
out"'"pat;
by patient visit;
run;
proc sort data"'oracledh fnscore
out=score;
by patient visit;
Type
NOT NULL NUMaE:R
NOT NULL NUMBER
NOT NULL DATE
NUMBER
NUMBER
NUMBER
* '" primary key
run;
Figure 1. presents a pictorial view of how the
sample database is implemented in 2NF form. The
key to this database design is in the fact that a
one-to-many relationship exists between the
patient~demog tables and the patient~measure and
test scores tables. It is the implementation of
these one-to-rnany relationships that serve as the
foundation to good database design.
data scores;
merge pat score;
by patient visit;
keep patient visit invnum sex race
testnum result;
run;
proc print data""scores;
run;
Second Normal Form (2NF) Data Structure
Access to the information in each of the tables
is obtained using a join, similar to the INF
query, except that additional WHERE clause
criteria is required to link the patient measure
data with patient demography data via the patient
key. The SQL statements below present the logical
join required to read selected information from
each of ~he tables in a 2NF implementation of the
database. As with the 1NF join the example below
will complete the join and display the results in
a single step.
Second normal form (2NF) provides that all of the
columns in a table are dependent upon the entire
key of the table. Any column of information in a
2NF designed table should require all elements
that comprise the tables primary key to uniquely
identify it. If the column can be uniquely
identified by only part of the key then that
column, and the relevant keys, should be
separated out into another table. More formally,
a table is in 2NF if it is in INF and there are
no columns which are dependent on only part of
the primary key.
lihnatne oracledb "[]";
proc sql;
select a .patient, a. sex, a. race, a. studydg, a invname,
b. testnum,b .rasult,
c.visit, c. visitdt, c .bysys, c.bpdys
from oracledb. snpat a, oracledb. snscore b,
oracledb.snrslt c
The information regarding repeated measurements
of blood pressure and heart rate are visit
dependent, and as such should be partitioned out
into an individual measurement table to meet the
424
from oracledb . .,npat a, oracl...db cns",ora b,
oracledb. snrs.lt c
where a.patient = b.patient
and a.patient
e.patient
and b patient
c.patient
b.visit
c.visit
and
quit;
where a.pati.ent - b.patient
and a.patient
c.patient
and b. patient
c.patient
c.visit
and
b.v:!.sit
order by a.patient, b.visit, b.testnum
-
quit;
To accompllsh the same task, as in the above SQL
statements, the following basic DATA step and
procedure logic could be utilized to complete the
join of these three tables. This traditional
implementation of SAS programming steps assumes
that the data are not stored in sorted order in
the tables.
,
Once a view of the database is created and stored
in the catalog then the complete power of the SAS
System can be ut..:i;lized to access, analyzer and
report on information stored in the database as
if it were a SAS data set. For example, simple
statistical procedures can be executed directly
against the database using SAS procedures. The
fOllowing example reads directly from the
database view defined as SNTABLE, created above,
to complete a means and frequency distribution
computation.
libname oracledb "[]";
proc sort data=oracledb snpat
out=pat;
by patient;
proc sort data_oracledb.snscore
out=seore;
by patient visit testoum;
proc means n mean data=oracledb. sntable;
by patient visit;
var result;
run;
run;
proc 8<Jrt data",oracledb snrslt
proc freq data"'oracledb.sntable;
where visit'" 1;
tab.le atudydg*sex*race;
out~rslts;
by patient visit;
data JllUltipat;
merge score (in=in1)
rslts (in~in2)
pat (in"'in3);
by patient visit;
if in1 and in2 and in3;
run;
Advantages of Views into Databases
run;
Creating and utilizing views offers several
advantages when working with relational
databases. These advantages are not limited to
2NF fODffi architectures and should provide benefit
across all forms of database design. Advantages
include:
proe print data=sntable;
Access to a 2NF database architecture need not be
as cumbersome as in the. above example, it can be
as simple as reading from the traditional SAS
data set structures utilized historically (e.g.
the single SAS data set) by creating a view
(stored SQL statement). The ability to create a
vlew of the joined information and access the
view directly, with either the DATA step or a
procedure, is a new feature of SAS® Software
that will allow for greater speed of retrieval
and much more robust access to data in a
normalized database. Now applications can utilize
the procedural language programming constructs
that have been time tested over many SAS®
Software applications with views directly into a
database as wel~ as the traditiona~ OAO data set.
The following example creates a view of the
infoDffiation and permanently stores this view in a
catalog. Note that this example creates a view of
three tables of information in a manner that is
consistent with looking at the data in lNF. Once
this type of view of the database is made
available to the end-user then all of the
components of the SAS System become available to
an application.
savings on disk space as views are virtual
usage o'f the machine. Views are nothing more
than stored SQL statements (virtual tables);
no disk space is used to store the data that
is "project/joined" by the view.
data will always be current in that the
database is accessed directly each time a
query'is made.
data can be shared among more users without
the creation of additional of data sets.
many different views or combinations of views
can be created to grant access to all or
selected information in the database.
Third Normal Form (3NF) Oata Stnocture
Third Normal Form (3NF) eliminates all columns of
information in a table that are uniquely
identified by data items in the table other than
key data columns. All non-key columns are
dependent upon the entire key (2NF) and have n~
other field dependencies. This latter type of
dependency is referred to as a transitive data
dependency (Date 1981). Columns that have a
transitive data dependency are removed, along
with their respective-identifiers r into a
separate table. Most database designers aim to
lihnama oracledb "[]";
proe sql;
create view oracledb. sntable as
select a.patient, a. sex, a. race, a. etudydg, a invname,
b. teetnum,b. result,
c.visit, c .v:is:itdt, '" .by"ys, c. bpdy"
425
one investigator is associated with many
patients, and a one-to-many relationship exists
between the patient~ table and the
patient_measure and test scores tables. The
projection from 2NF to 3NF offers greater
flexibility in database design in that
information is discretely grouped into logical
tables of infor.mation that are properly related.
create a 3NF database environment. The Boyce Codd
Normal Form (BCNF) is an extension of 3NF. BCNF
diffe~s f~om 3NF in that it applies to candidate
keys as well (Date 1981).
In the 2NF implementation of the sample database
a transitive dependency exists in the patient
demog~aphics table. The information associated
with an investigator (inv num, inv_name,
inv_phone) is not totally dependent upon the
components of the prima~y keys in the 2NF
patients table. In fact, the investigato~s
information is independent of the patients
information, but a patient cannot be without an
investigator.
Therefore, the information
regarding an investigator is partitioned out into
its own table thus eliminating the transitive
dependency. The identifying key in the
investigators table (inv_num) is left as a column
in the patient data table to allow for the
identification of the investigators information
associated with that patients information. This
is referred to as a foreign key.
Acces~ to the 3NF databa~e is the same as with
the 2NF database design. Additional WHERE clause
c r i t e r i a are required to address the
investigators table. The example below
demonstrates the join of four tables to create a
view into the sample database in 3NF form.
1ibname oracledb .. (]";
proc sql;
create view <:>racledb.tntable
select a. pat i6nt, a. sex, a. race, a studydg
b. teatnum, b.result,
c. visit, c.visitdt, c.bysys, c .bpdys,
d. invname
from oracledb.tnpat a, orael..db.tnscore b,
oracledb.tnr.s1t c, oracledb.tninv d
where a.patient = b.patient
and a.patient '" c.patient
and b.patient .. c.patient
and
b.visit
c.visit
and a.invnum 'Z d.invnum ;
quit;
The 3NF database design can implemented as shown
in table 5. Table 5. shows that an investigators
table is created and that the patients
demographics table is altered to allow for the
investigator number to be a foreign key. The
transitive dependencies of investigator name and
phone number are eliminated from the 2NF
implementation. All other lNF and 2NF table
designs are retained.
Tabl.. 4.
Access to a 3NF implemtation of the database
using traditional SAS programming (DATA and ~ROC
steps) constructs would be similar to the 2NF
database implementation. Additional sorting and
merging would be required to match merge the
investigator with the patient. The traditional
SAS programming steps to match merge these tables'
of information is not being presented because as
a database is fully normalized the benefits of
using views becomes obvious with the saving of
machine and programming resources.
Third Normal Form (3NF)
3NF data tables for Drug Survey Example. The information from
the inve3tigator is partitioned out into a separate tab1e.
The test Bcot"es table is maintained from INF, and the measur..
table is retained from 2NF.
Table Name: INVESTIGATORS
Null?
"'INV_NUM
Type
NOT NOLL NUMBER
INV_NAME
INV_PHONE_EXT
CHAR (7)
NUMBER
Fourth Normal Form (4NF)
'" = primary key
Fourth normal form (4NF) provides for the
decomposition of a relation into two or more
projections (e.g. subset tables) if the relation
has multi-valued dependencies that are not
functional dependencies (Date 1981, Watterson,
1989). Simply stated, if a table has more than
one column (or sets of columns) which are
independently dependent upon the key then these
column(s) and key should be partitioned out into
a separate set of tables. It is not the intent of
this paper to discuss the architecture, access,
or retrieval from this form of normalized
database in the examples presented. This
definition is presented to demonstrate that the
relational model can be extended beyond the
traditional 3NF forms that is normally used by
database designers.
Table Name:PATIENT_DEMOG
Name
*PATIENT
SEX
Null?
Type
NOT NULL NUMBER
AGE
AGE UNITS
CHAR(6)
NUMBER
CHAR(6)
RACE
CHAR(l)
MARTIAL_STATUS
STUDY DRUG
CHAR (7)
CHAR(l)
NOT NULL NUMBER
'" '" primary key
Figure 2. presents a pictorial view of how the
sample database is implemented in 3NF form. The
keys to this database deSign are in the fact that
426
Transposing a Normalized Data Structure
Summary
One of the strengths of a normalized relational
database outlined thus far is the elimination of
repeating groups (e.g. arrayed structures). This
In this paper a discussion has been presented on
the use of different degrees of normalization
using a relational database and the interface
with SAS@ Software's procedural language
components. The use of a relational database can
improve the quality and accessibility of
information requirements in many different
applications. By applying normalization
techniques, application developers and end-users
alike can take advantage of the str~ngths of
storing data in the various normal forms (lNF,
2NF~ 3NF, or 4NF). These strengths include, but
are not limited to, the elimination of redundant
data~ broader access, and improved utilization of
disk space.
with the introduction of
SAS/ACCESS@ software the application developer
and end-user now have the ability to fully
integrate the strengths of SAS programming
constructs (DATA step, Macro facility, and
statistical procedures) with a relational
database. The combination of these two
application development software packages
provides a tool set for systems developed in the
1990's.
aspect of normalization is carried from the very
first step in applying normalization techniques
to database design. Figure 1 clearly shows that
the TEST_SCORES table provides for three keys to
fully describe the individual rows in the table
(patient, visit, and test_num), thus eliminating
the array constructions. Often programmers want
to present or analyze information structured with
repeating groups utilizing the strengths of array
processing. An application can s t i l l take
advantage of these array processing features by
creating a discrete view that transposes the
information. In the sample database a view of the
TEST_SCORES table is created that transforms the
individual test numbers into columns. For example
the following syntax creates a permanent view
into the TEST_SCORES table that is the join of
the table on itself once for each of the test
numbers (1 through 5) .
proe
aql;
create View oracledb.tranpo!!e a$
!!elect tl.patient, tl.vi!lit,
tl. result ' test_l' ,
t2.reault 'teat_2',
t3.reault 'teat_3',
t4.r .. sult ' t .. st_4',
t5.result ' t .. st_5'
from anacore tl,
/!Inscore t2,
t3,
snscor" t4,
snscore t5
where tl.patlent "" t2.patient
and tl.patlent
t3.patient
and tl.patient
t4.patient
and tl.patient = t5.pati .. nt
and
t1.visit - t2.visit
and
t1.visit
t3. visit
and
tl.vi .. it
t4.vi .. it
and
tl.visit
tS.visit
and tl.t .. stnum ..,
and t2. testnum - 2
and t3. t .. atnum = 3
and t4. to!lstnum .. 4
and t5.tO!lstnum =
Acknowledgements
The authors wish to thank Ms. Nancy Wheeler, and
Mr. Robin Pasley for their contributions and
review to this paper.
References
Date C.J. 1981. An IntroductLon to Database
Systems. Addison-Wesley Publishing. Reading, MA.
574 pages.
.
-
Kemm T. 1989. A Practical Guide to Normalization.
DBMS December 1989 pp:46-52.
-,
Watterson K. 1989. From chaos to order. Data
Based Advisor,
Feb. 1989 7(2) :33-37.
Contact Information
quit;
Access to this type of view can utilize the ARRAY
statement syntax supported in the DATA step or
with the VAR statement in procedures as
demonstrated in the following example.
James R. Johnson
Roger D. Cornejo
Glaxo, Inc.
MIS - Scientific Computing
Five Moore Drive
Research Triangle Park, North Carolina 27709
Phone: (919) 248-7341
FAX:J: (919) 248-2571
data _ nl1ll_:
aO!lt oraeledb. tranpose;
array teats{5} test_1 - test_5;
do j=l to 5;
if test{j} = . then
put "Missing Test Scor....
patient'" +1 visit- ;
end;
SA.'l, SAS Software, and SAS/ACCESS Software are registered
ot SAS Institute Inc., Cary, North Carolina USA.
run;
DB2
proe print data-oracledb. tranpose;
var patient visit test_2 test_4;
i~
trademark~
a registerad trademark of IBM Corporation.
Rdb Is a
r~i ... tered
trademark of Digital Equipment Corporation.
run;
ORACLE ROBMS 1" a registe"",d trademark of Oracle Corporation, B<:>lmont,
California USA.
proe mean" data=oracledb .tranpoae;
var test_1 teat_3 test_5;
run;
427
Figur,e 1. Physical Implementation of a 2NF form database for the example.
PAllENT.J)BMOG
Primory Key
.paII...,
PrirnooyK.,..
....
~~
• .....~~"m
Figure 2. Physical implementation of 3NF database tables for sample application.
PATII!NT..MEASIJlt£
TIIST..s<:Oll1!S
INVSSTIOATORS
PrlmKyX.y.
~-.
~.~
-inv__ ....
~--
APPENDIX 1.
Sample Data for Normalization Examples.
Sample Drug/Response SUrvey Information
Patient Visit Visit Date Selt
10
10
10
10
10
1
2
01-JAN-89
05-JAN-89
01-FEB-89
15-FES-89
01-MAR-89
50
50
50
50
50
1
4
5
1
2
YEARS C
YEARS C
YEARS C
YEARS C
"""=
SINGLE
2.
SINGLE
SINGLE
26
21
snmLl!!
20
20
..
"
"
40
50
50
55
55
45
97
42
56
55
96
99
41
99
98
99
100
100
99
100
A
A
25
B
B
A
A
A
MARRIED
65
60
78
63
60
30
35
'tEARS 8
77
65
60
37
19
27
YEARS B
MARRIED
79
69
MAAR"'D
70
39
42
26
25
B
YEARS B
61
63
35
35
35
35
35
YEARS C
45
44
43
30
B
59
100
99
97
61
100
YEARS C
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
01-FEB-89
10-FE8-89
26-FEB-89
25-MAR-89
16-APR-89
FEMALE 45
FEKALE 45
FEMALE 45
FEMALE 45
FEMALE 45
YEARS B
MARRIED
8
YEARS B
YEARS 8
YEARS B
MARRIED
15-FEB-89
01-MAR-89
07-MAR-89
MALE
MALE
'tlUIRS C
YEARS C
MALE
55
55
55
15-MAR-89
MALE
55
YEARS C
26-HAR-89
MALE
55
'tlUIRS C
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
26-JAN-89
10-FE8-89
25-FEB-89
05-MAR-89
26-MAR-89
40
40
40
40
40
MALE
YEARS C
FEMALR 25
25
25
25
25
FEMALE
FEMALE
FEMALE
FEMALE
MALE
MALE
MALE
MALE
MALE
YEARS C
YEARS c
YEARS C
YEARS
YEARS
C
MARRIED
MA!\RIED
MARRIED
'7
58
57
66
33
53
63
31
32
"
.
B
B
B
B
100
A
A
90
65
67
67
100
91
100
A
84
100
100
99
"
35
2'
71
27
B
100
61
65
66
61
B
67
.
428
so
5
10
12
83
n
19
16
B5
91
B5
N_
Phone
Ext.
SyS/Dys
. .c_
01
01
01
01
01
Smith
Smith
Smith
Smith
smith
1223
1223
1223
1223
1223
120/12
122/76
134/79
126/12
135/79
75
05
05
05
05
05
DeMarco
DeMarco
DeMarco
DeMarco
DeMarcQ
8901
8901
8901
8901
8901
129/81
127/11
125/11
121/72
128/80
"
"
01
01
01
01
01
Smith
Smith
smith
Smith
smith
1223
1223
1223
7223
1223
135/68
135/69
.2
134/77
80
10
10
10
10
10
Yourdon
Yourdon
Y ourdon
Yourdon
Yourdon
5512
5512
5512
5512
5512
161/80
165/72
166/82
165/88
178/90
"82
05
05
05
05
05
DeMarco
DeMarco
DeMarco
DeMarco
DeMarco
8901
8901
8901
8901
8901
192/99
188/86
115/82
166/19
126/12
75
74
72
74
Study Inv.
Drug NUIII.
"""""'D
MAAR"'D
30
30
30
30
30
5
MALE
MALE
15
15
15
15
15
os
YEARS B
20
4
MALE
----- Test Scores -----
YEARS B
15-JAN-89
01-FEB-89
15-FEB-89
15-MAR-89
15-APR-89
20
20
20
20
MALE
Age
Martial
Age Units Race status
A
A
B
B
B
B
B
Inv.
BP
Seart
.."
99
97
66
67
75
80
./.
./.
85
84
84
65