Download Relational Database Design Theory: An Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operational transformation wikipedia , lookup

Data center wikipedia , lookup

SQL wikipedia , lookup

Versant Object Database wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Database wikipedia , lookup

3D optical data storage wikipedia , lookup

Data model wikipedia , lookup

Business intelligence wikipedia , lookup

Clusterpoint wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data vault modeling wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
RELATIONAL DATABASE DESIGN THEORY:
AN OVERVIEW
Faith Renee Sloan,
Parke-Davis Pharmaceutical Research Division, Ann Arbor, Michigan
. .intained. At best, i t _kes code bard to unclersbnd,
difficul.t to _intAin, and inefficient to run.
At.
worst, we _y qet well into the developaent procesa
before realilinq that the system is incapable of
produeinq some required output elements.
ABSTRACT
ENTITY-RELATIONSHIP MODEL
Wi~ the addition of Strpctured Query Lanquaqe(SQL)1 II.
dateb.se 1anquaqe that bas become the American National.
standards Institute (AKSl:) standard for rel.atlonal.
1anQUaqe, in V.rsion 6 BAS· software, i t 1&
One of the priaary modeling techniques is that of the
::~;!e~;!f1~~~t Sf: ~r:~~:na\~~f:.b~e a::Si~v~=~~
~t:t1~~'i1':t~"!.~!;! ~~~~leU: l!oor~!r ~:t;:cJ.~~:!~:
'1!~t3i~~:!!0:n~t&i:~~~~~'pr~~:':~i~~!::1::ew:~~~
~~BC~:e4pa~~c~eirw:-r:!i!!anS:~Pf=J... W~!.l ~
database. wi}.l. be discussed.
built around the identification, analysis, description,
i!lnd relationahipG of real.-world funct.ions, proc _ _ s,
and dab entities of the business.
The E-R .odel. is a .odel co.prised of the following
three types of constructs;
da~e
~~~tr~i~~:c~,.~ ~;.\!;r~1;a::~~c!~~91Jt~ni~: ~!~~
the efficient querying Of the sallie by our users.
1:n
thia paper 5 0 _ desiqn issues regarding re1ationa.1.
RELATIONAL THEORY AND DESIGN
ReJ..tional. theory was deval.oped in the l.at. 1.960& and
publ.l&hed in
researcher.
1970
ThIs
1. entities and attributes(data items)
2. entities relationships
3. attribute constraints or domain
by Edgar.
F. Codd, then an J:BK
theory assists in the design of
rel.ationa1 databases and in the efficient proceaslnq of
user requests for infor.ation from ~ database.
~s~~~~~yat!:t !he;:~:n~to~!.a~:fa.orEX=~~S ~~e:~t1~;
types are PATI.EHT,
vrTAL Sl:GNS,
HOHl:TOR CONTACT,
KEDl:CATrOU,
TREltAPEOTI.C AliitEA,
ADVB1I:SE
EVENT,
D1I:UG
INVRH'l'OR'l
t.ocATl:OH,
DELIVERY
SCHEDULE,
etc.
An
individual entity will become a row in a relational
table.
An entity &~t is a set of entities of the sa_
type.
This wil.l beeome a
table of a
relational
database. For example, the set of all patients with EKG
abnormalities are defined as the entity set EKGABN01I: in
The fol.l.owlnq four basic premises can be derived from
Codd's 12 established rul~s of relational theory:
TABLES
ar~
~se.
2.
used
to
store
information
in
The tables can be related to one another in
different 1D4nners to represent various data
relationships and Views by usinq 1I:ELATl:OHAL
! ! r i : e t ; d:;:;f:~t6~s ~:n
ABCODES, for example.
~.
>.
~I~Ll:Z;I;f:fi pro:ide:On~ts~~d a;;~~logio
Attributes are the smi!lllest units of data in the data
..odel.
These will become the columns in the databi!lse
tables.
An
entity is represented by a
Get of
att~ibutes.
Attributes of the EKGABNOR entity are
trial, patient-number, visit, and EKO abno~1ity code.
Possible attributes of th~ ABCQDES entity could be EKG
abnorma1ity code and EkG abnor.a~ity description.
relational database design.
4.
~h~:~llt:H:: d~i~i~:!f a?~~b~{e:~ta
and
A relational database consists of a collection of
tables, eaeh of ~hieh is assigned a unique name.
Each
ta61e1ias rows and columns. A row in a table represents
A relationship 1s an association among severi!ll entities.
Ei!lch entity or table may be reli!lted to any or all of the
other tables in any number of ways. For eXalapl.e, we may
define a
relationship set VrTEkGAB to denote the
association between the tab}.e in Figure 2~ VITALS,
patients vital signs and the table in Pigure 1,
EKGABNOR, patients EKG abnor.alities.
The association
is depicted in Figure 3.
The SAS SOL code to generate
this association or j01n.is as follOW:
: ~~1\~~i~iPo:m~!!~a :ei~:fo:a~~; J;!~~e i~ t:b!~o!:·
correspondence between the concept of table and tha
mathematical concept of relation,
from--wbich
th~
r~lational data model takes Its name.
a~er:ra ~;~:~:~:a~o~:bl~,!n~~l.~, ~~~!~
cl.aUfles.
The resuLt of a SOL query ""'ISa
Following is a simple query using the
sample table found in Figur~ 1.
"Display the
trial, patient number, and EKG abnormal.ity of aLL
patients with an EKG abnormality code that is betWeen
100 and 200 inclUsive.":
proc sql.:
select *
!~ev;~:!~t~~Y!~~~~O:OCOl
and v.trialze.trial
and v.ptno=e.ptno
and v.visit~e.visit;
proc sql;
select trial,ptno,ekgabn;
fro_ ekgabnor;
where ekgabn between 100 and 200;
For each data element or attribute there is a $et of
allowed Values, called the domain of that attribute.
NOT NOLL(NN) , NUXERZC(lf), ~(C), PlI:IKARY kEY(PK),
etc.
The resul.t of this SOL query is the following rel.ationl
TR.IAL
1
1
PTHO
2
2
Bbet da:fi~~~ ~nt~a;~ilt~O:~
EKeARN
The effectiveness of the E-1I:
structures is constrained by
fi!lctors;
10'
102
In order to describe the structure of a database, we
need to def ine the concept of a data lIIOdel.
A data
model is a collection of conceptual toois for descriDIii'g
~ data relationships, and consistency constraints.
Any system's file structur~ is a model. of data that
pertaLns to the part of the real. world rel.~vant to the
appl.ication.
approach in defining
the
following
three
1. entity identification
2. entity definition
3. entity relationships identification and
defin1tion
:::i
~!~~~~ ~~~it~n.,~l~~~ dO:te::f~~::y
a~:c~Ent!:er!::
i~e~f l~::r;:a:;oa~ -:n~i~~~a=a;r:: 'onl~h:iili~i~~
Designers and developers ean reap _ny benefits from
good data JIIodeling. :rt makes code 1II0re straight.forwal:'d,
context. of the organization.
r t is whatever the
business defines i t to be, and thi!lt definition must make
sense within the context of the organization.
!~:~~~~ :yes~~1IIe;e~~~~~~:~t.an-:r~r~!~~n::~ i~~~~i~~
I t also gives the syst.em greater durability in the face
of inev i table changes, al.lowing for a lonqer useful
lifetime.
Second, i t must be determ1ned as to which attributes of
each of the identif iad entities are needed by the
organization and why. Make a tentative l i s t of all the
attributes that might conc<!livably be relevant to each of
the identifi~d entities.
1I:ev1ew th~ l i s t carefUlly.
A poorly designed data structure, however, causes
trouble early in the development process and i t keeps
creating proble1DS for as long as the installed syst.em is
518
When the right attributes are . .tcbed up with the right
entit~e&. figure out how each attribute ai9ht best be
represented as one or mOre colu.ns in the ~le
a ••ociated with its entity.
For example, we need to
carefully cons~der the priaary key for each entity in
that we n.ed to identify one that will uniquely identify
each row of the entity (for axample, patient nWllber
concatenated with protocol and trial).
difficul.ties: and erroneously colllb1n1ng an entity tabl.e
Gnd a relation table into a &ingl.e table.
As 1l.].uatrated in F1gure 6, the nRUGHAME is an attribute
of the PROTOCOL rather t.ahn Of the TRrAL within the
PROTOeOL.
Thus i t is dependent on1y upon PROTOCOL
rather than upon the concatenation of both PROTOCOL and
TR];AL. But the STATUS colUJllll denotelS the status of the
TR:rAL within th., PROTOCOL for the stUdy.
:rt is
dependent upon the entire primary kay and i t i ..
1egitiaate in th1. table.
see Figure 7 for a possib1e
solution to this probl.e••
And finally, the analyst or designer .oat identify and
define the
r~1.tion&hiP5
whiCh eXist between the
ident~fied
and defined entities and their relative
iaportance to the o~anization, noting the primary and
secondary keys that .ay be necessary to represent these
relationa.
These relationships can be difficu1t to
define since tha relationships are dyn_ic depending
upon the situational requirements.
The third for. of normaLi.ation inv01ves eliainating the
depen<lence of any non-key field upon any other fie1ds
eXCtlpt the pri_ry key. No two non-priaary key eolu.ns
i.... a tabl.e may be rel.ated 1n a one-to-one IIUUlner
(transitive dependence).
V101atinq the thi:rd normal
fora will cause the same symptoms aa when vio1ating the
first and second normal forma.
The above is an i terati ve proc_s.
Deciding what
entitles and attributes are relevant to our application
~s
a very iaportant and creative aspect of .yate.
analyai. and desiqn.
:rt helps a l.ot to know the
busio_a or industry in Which you are working, and i t
helps even 1IIOre to have a broad understanding about how
your potential user. think.
A.a illustrated in Figure . , the STUDYTEAM tabl. .. , the
prob1e» here is that
have two non-key fiel.ds
dependent upon _eh other rather than upon the prilD4ry
key field, EKPLOYEE-coDE.
TEAll and LEAl)ER-CODE are
transi t i vel.y dependent upon one another.
Xaking a
change
to
either
of
these
two
flelds
without
si.u1taneous1y making appropriate cbange to th.. othar
.ey introduce anomali . . into 'our databa• .,.
w..
DATABASE NORMALIZATION
One possib1e solution is to to re-structure the table
into .uJ.t1p1e tabl.es . . in Figure 9.
The relational model J»:ovidas a number of powerful
analytic tools that offe:r invaluable assistance in
designing and optilDizing the data structures for our
applications.
The to01s are called 'Ute nor_I fo~,
and
the
process
of
applying
them
i.
called
no:r:aa11zation.
We could 90 on normal.is1ng to the nth fOrIQ.
But further
nora.11zing is iJapractical and unnece68ary.
For lIIOrt!!
detailed normal.ization reaearch, :r WOUld recommend that
the readeX' refer to ' A Relational. Hode1 of ))ata for
Larqa Shared. Delta Banks.' by Edqar.F. Codd publ.ished in
COlllBlUDications of th& ACM, V01 13 Ko. 6~ June 1970,
'R&1ational Date-bases: The ».eal. story' by Steven J.
Vaughan-Nichqls in BYTE, Vol. 15 No. 14 ,December 1990 ..
and
'rnterfacing
Normalized
Relational
Database
Structures With SAS SOftWare' by Ja_a R. Johnson and
R~r O. Oo~ejo of Glaxo pub11shed in &AS Users Group
International
15th
Annual.
Conference
proceedings
(pp.421-427).
NOrllUl.l.ization is a f~rII!ll., systemat1c procedure for
t • • tinq your tentative data atructure against a series
~~:,!;;t::fn~~,:~es un~:ly'i~iV~!og:::!iro::~1Il d~:
BOdel.
:rt insures that the data IIIOde1 i& in the most
stable(i.e.; least like1y to change) foru.
:rf we use
the ru1es of nor_lization to create data structures, we
ara l.ikely to save a lot wo:t:k 1n the develop_nt
proceSS.
Four 1.portant properties are associated with a
that represents a rel.ation in a database:
l..
2.
3.
4.
CONCLUSION
~
No two rows can be identical.
The ordering of thase rows is
inaignificant.
The ordering of the col.u.ns is
inSignificant.
Each data ita. "ould be atomic(i.e.,
nondeao.posable, such as an integer or
character etring).
:rn a relationa1 database, even the 1II0St cOlllpl.icated data
relatioft8hlpg can be reduced to two-dimensional table
:~~:s~ .:~:8 ~:nq1~~An~o~;f~;r~:ninf~:=~~n ea~!~
than Lt is under other data IIIOdeLs.
::::'1fl~~~.fo~~r~~:~t.!,~s :::::ii~i~~11;it~ i:i~~!~:~
But.. nOrll\B.lization is not ill panacea.l
:rt will not
validate the selection of ent~ties in the data mode1~
detact
redundant
entities,
or
validate
entity
relationships. What practica1 norma1ization wi11 do 1s
.in1.ize data uainten~nca anomal1es; .inimlze data
£he potential
r.dundacies.
~~unt:n~ e=~e s~ie th:O~:te~dl ::!:-:illlS.~ru~!~e:
for
data
maintenance
anomalies
and
1IIlllt1ple ent1ties whlch were erroneously colllbined 1nto
ona entity
in
the
data
1II0del(third
normal.
form
violation) .
The normal.ization process consists of sequential. steps
for refining the data model..
A design; whicb has been
refined through the first step is said to be in first
nor.al fOrB.
If the second step is COlllpleted; the
de.iqn is in the second norlllal forJ;i. The first, second,
and third nOrmal fora&. yield substantial pract1.;!al
benefits.
In general; the goa1 of a rel.tional database design is
to generate a set of relat10.... schames that a110w us to
:tt~: ~f~~!~~re.::tt:::rm.~~~~es:a~1).y~ed~~nth~ ~e;
of SQL; end-users would now be able to access data in an
ad hoc way.
Complying
with
the
first
nOrll""l
for.
requires
elimination of repeating columns or ~oups of columns
frOB each tab1e. All columns in a t""ble must he related
to the primary key of that table in a one-to-one .anner.
:rn Figure 4 the prbll.ary key of the Dl!:CG table 1s
PROTOCOL. The table is in violation of the first normal.
fora rule since each protocol _y have more than one
1nvestigator.
PROTOCOL and DRUG is repeated for
PROTOCOL 999.
Acknowledgements
r wish to thank Gail Scherer and Iraj Mohebalian for
their technical review of this paper.
As ill.ustrated in Figure 4, violating the first normal
fora rul.e may cause data redundancy since entire rows
aust be repeated to al.low for mul.tiple values of tha
incorrectly placed coluant inability to completely
id~ntify what the incorreatl.y pl.aced coLumn represents;
and _intenanee difficulties(i.e.; if adding a new
investigator means adding a new row; all the protoco~'s
attril:n1tes(e.9.,
drug
name)
IllUSt
be
completel.y
reentered.
:rt becomes apparent that 1nvestigator name
is not lUI attribute of a drug and thus IHVES'l'NAHE shoul.d
not be a column in the DRUG table.
SAS is
a
r-eq1B~ed trademark of SAS
II. rl'lqisterad tradltXllI.rk of
ORACLE 1s
Figure 5 showS the DRUG table restructured to correct
for tbe first nOr1!lal. form. viol.ation by deleting the
:rHVESTHAME col.umn.
Of course~ an investigator entity
wou~d
have
to
be
created
with
its
appropriate
attributes.
The second degree of noraal.ization involves tabl.es that
have cO!!!pOsite primary keys(PK) (i.e., priJaary keys that
are constructed by concatenation of two or .are
secondary keys).
Violating the second normal. form. ru1e
.ay cause: data redundancy due to one or .are columns
containing frequentl.y repeating data valu",:!!; _intenance
519
T"",titub. :r"o., Ca~. ltC', USA.
oracle Corporation.
EKGABNOR TABLE
Figure 1
PROTOCOL
TRIAL
PTNO
VISIT
EKGABN
N,NN,PK
N,NN,PK
N,NN,PK
N,NN,PK
N
999
1
1
3
230
888
1
2
3
105
999
1
2
3
102
999
2
1
4
609
Sample EKG Abnormality Table
VITALS TABLE
PROTOCOL
TRIAL
PTNO
VISIT
OBP
SBP
HRT
WGT
N,NN,PK
N,NN,PK
N,NN,PK
N,NN,PK
N,PK
N
N
N
999
1
1
0
82
142
72
200
999
1
1
3
88
132
80
195
999
1
2
3
96
138
68
155
Figure 2
VITALS Entity Set or Table
VITEKGAB
PROTOCOL
TRIAL
PTNO
VISIT
OBP
SBP
HRT
WGT
EKGABN
N,NN,PK
N,NN,PK
N,NN,PK
N,NN,PK
N
N
N
N
N
999
1
1
3
88
132
80
195
230
999
1
2
3
96
138
68
155
102
Figure 3
Relationship between VITALS and EKGABNOR tables
520
DRUG TABLE
Figure 4
PROTOCOL
DRUGNAME
INVESTNAME
N,NN,PK
C
C
888
Primon
Doe
999
Bacrinol
stein
999
Bacrinol
Johnson
999
Bacrinol
Miller
Violation of First Normal Form Rule
DRUG TABLE
Figure 5
PROTOCOL
DRUGNAME
N,NN,PK
C
888
primon
999
Bacrinol
Correction of Violation of First Normal Form Rule
521
TRIAL TABLE
Figure 6
PROTOCOL
TRIAL
DRUGNAME
STATUS
N,NN,PK
N,NN,PK
C
C
888
1
Primon
CONT
888
2
Primon
WITHDR
999
1
Bacrinol
CONT
999
2
Bacrinol
CONT
Violation of Second Normal Form Rule
TRIAL TABLE
Figure 7
PROTOCOL
TRIAL
STATUS
N,NN,PK
N,NN,PK
C
888
1
CONT
888
2
WITHDR
999
1
CONT
999
2
CONT
Correction of Violation of Second Normal Form Rule
522
STUDY TABLE
Figure 8
EMPLOYEE-CODE
TEAM
C,NN,PK
C,NN
C
426111817
A888
359111111
357888222
B888
426111817
111224444
Z921
431445555
999999999
Z999
888998888
LEADER-CODE
Violation of Third Normal Form Rule
STUDY TABLE
STUDY-CODE
STUDY-NAME
C,NN,PK
C
C
111
Tacrine1
426111817
222
Pirmenol1
222334444
333
Tacrine2
888998888
444
Quinapril1
345678901
LEADER-CODE
STUDY-TEAM TABLE
Figure 9
EMPLOYEE-CODE
STUDY-CODE
C,NN,PK
C,NN,FK
426111817
111
357888222
111
111224444
333
999999999
444
Correction of Violation of Third Normal Form Rule
523