* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Relational Database Design Theory: An Overview
Survey
Document related concepts
Operational transformation wikipedia , lookup
Data center wikipedia , lookup
Versant Object Database wikipedia , lookup
Data analysis wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
Business intelligence wikipedia , lookup
Clusterpoint wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data vault modeling wikipedia , lookup
Transcript
RELATIONAL DATABASE DESIGN THEORY: AN OVERVIEW Faith Renee Sloan, Parke-Davis Pharmaceutical Research Division, Ann Arbor, Michigan . .intained. At best, i t _kes code bard to unclersbnd, difficul.t to _intAin, and inefficient to run. At. worst, we _y qet well into the developaent procesa before realilinq that the system is incapable of produeinq some required output elements. ABSTRACT ENTITY-RELATIONSHIP MODEL Wi~ the addition of Strpctured Query Lanquaqe(SQL)1 II. dateb.se 1anquaqe that bas become the American National. standards Institute (AKSl:) standard for rel.atlonal. 1anQUaqe, in V.rsion 6 BAS· software, i t 1& One of the priaary modeling techniques is that of the ::~;!e~;!f1~~~t Sf: ~r:~~:na\~~f:.b~e a::Si~v~=~~ ~t:t1~~'i1':t~"!.~!;! ~~~~leU: l!oor~!r ~:t;:cJ.~~:!~: '1!~t3i~~:!!0:n~t&i:~~~~~'pr~~:':~i~~!::1::ew:~~~ ~~BC~:e4pa~~c~eirw:-r:!i!!anS:~Pf=J... W~!.l ~ database. wi}.l. be discussed. built around the identification, analysis, description, i!lnd relationahipG of real.-world funct.ions, proc _ _ s, and dab entities of the business. The E-R .odel. is a .odel co.prised of the following three types of constructs; da~e ~~~tr~i~~:c~,.~ ~;.\!;r~1;a::~~c!~~91Jt~ni~: ~!~~ the efficient querying Of the sallie by our users. 1:n thia paper 5 0 _ desiqn issues regarding re1ationa.1. RELATIONAL THEORY AND DESIGN ReJ..tional. theory was deval.oped in the l.at. 1.960& and publ.l&hed in researcher. 1970 ThIs 1. entities and attributes(data items) 2. entities relationships 3. attribute constraints or domain by Edgar. F. Codd, then an J:BK theory assists in the design of rel.ationa1 databases and in the efficient proceaslnq of user requests for infor.ation from ~ database. ~s~~~~~yat!:t !he;:~:n~to~!.a~:fa.orEX=~~S ~~e:~t1~; types are PATI.EHT, vrTAL Sl:GNS, HOHl:TOR CONTACT, KEDl:CATrOU, TREltAPEOTI.C AliitEA, ADVB1I:SE EVENT, D1I:UG INVRH'l'OR'l t.ocATl:OH, DELIVERY SCHEDULE, etc. An individual entity will become a row in a relational table. An entity &~t is a set of entities of the sa_ type. This wil.l beeome a table of a relational database. For example, the set of all patients with EKG abnormalities are defined as the entity set EKGABN01I: in The fol.l.owlnq four basic premises can be derived from Codd's 12 established rul~s of relational theory: TABLES ar~ ~se. 2. used to store information in The tables can be related to one another in different 1D4nners to represent various data relationships and Views by usinq 1I:ELATl:OHAL ! ! r i : e t ; d:;:;f:~t6~s ~:n ABCODES, for example. ~. >. ~I~Ll:Z;I;f:fi pro:ide:On~ts~~d a;;~~logio Attributes are the smi!lllest units of data in the data ..odel. These will become the columns in the databi!lse tables. An entity is represented by a Get of att~ibutes. Attributes of the EKGABNOR entity are trial, patient-number, visit, and EKO abno~1ity code. Possible attributes of th~ ABCQDES entity could be EKG abnorma1ity code and EkG abnor.a~ity description. relational database design. 4. ~h~:~llt:H:: d~i~i~:!f a?~~b~{e:~ta and A relational database consists of a collection of tables, eaeh of ~hieh is assigned a unique name. Each ta61e1ias rows and columns. A row in a table represents A relationship 1s an association among severi!ll entities. Ei!lch entity or table may be reli!lted to any or all of the other tables in any number of ways. For eXalapl.e, we may define a relationship set VrTEkGAB to denote the association between the tab}.e in Figure 2~ VITALS, patients vital signs and the table in Pigure 1, EKGABNOR, patients EKG abnor.alities. The association is depicted in Figure 3. The SAS SOL code to generate this association or j01n.is as follOW: : ~~1\~~i~iPo:m~!!~a :ei~:fo:a~~; J;!~~e i~ t:b!~o!:· correspondence between the concept of table and tha mathematical concept of relation, from--wbich th~ r~lational data model takes Its name. a~er:ra ~;~:~:~:a~o~:bl~,!n~~l.~, ~~~!~ cl.aUfles. The resuLt of a SOL query ""'ISa Following is a simple query using the sample table found in Figur~ 1. "Display the trial, patient number, and EKG abnormal.ity of aLL patients with an EKG abnormality code that is betWeen 100 and 200 inclUsive.": proc sql.: select * !~ev;~:!~t~~Y!~~~~O:OCOl and v.trialze.trial and v.ptno=e.ptno and v.visit~e.visit; proc sql; select trial,ptno,ekgabn; fro_ ekgabnor; where ekgabn between 100 and 200; For each data element or attribute there is a $et of allowed Values, called the domain of that attribute. NOT NOLL(NN) , NUXERZC(lf), ~(C), PlI:IKARY kEY(PK), etc. The resul.t of this SOL query is the following rel.ationl TR.IAL 1 1 PTHO 2 2 Bbet da:fi~~~ ~nt~a;~ilt~O:~ EKeARN The effectiveness of the E-1I: structures is constrained by fi!lctors; 10' 102 In order to describe the structure of a database, we need to def ine the concept of a data lIIOdel. A data model is a collection of conceptual toois for descriDIii'g ~ data relationships, and consistency constraints. Any system's file structur~ is a model. of data that pertaLns to the part of the real. world rel.~vant to the appl.ication. approach in defining the following three 1. entity identification 2. entity definition 3. entity relationships identification and defin1tion :::i ~!~~~~ ~~~it~n.,~l~~~ dO:te::f~~::y a~:c~Ent!:er!:: i~e~f l~::r;:a:;oa~ -:n~i~~~a=a;r:: 'onl~h:iili~i~~ Designers and developers ean reap _ny benefits from good data JIIodeling. :rt makes code 1II0re straight.forwal:'d, context. of the organization. r t is whatever the business defines i t to be, and thi!lt definition must make sense within the context of the organization. !~:~~~~ :yes~~1IIe;e~~~~~~:~t.an-:r~r~!~~n::~ i~~~~i~~ I t also gives the syst.em greater durability in the face of inev i table changes, al.lowing for a lonqer useful lifetime. Second, i t must be determ1ned as to which attributes of each of the identif iad entities are needed by the organization and why. Make a tentative l i s t of all the attributes that might conc<!livably be relevant to each of the identifi~d entities. 1I:ev1ew th~ l i s t carefUlly. A poorly designed data structure, however, causes trouble early in the development process and i t keeps creating proble1DS for as long as the installed syst.em is 518 When the right attributes are . .tcbed up with the right entit~e&. figure out how each attribute ai9ht best be represented as one or mOre colu.ns in the ~le a ••ociated with its entity. For example, we need to carefully cons~der the priaary key for each entity in that we n.ed to identify one that will uniquely identify each row of the entity (for axample, patient nWllber concatenated with protocol and trial). difficul.ties: and erroneously colllb1n1ng an entity tabl.e Gnd a relation table into a &ingl.e table. As 1l.].uatrated in F1gure 6, the nRUGHAME is an attribute of the PROTOCOL rather t.ahn Of the TRrAL within the PROTOeOL. Thus i t is dependent on1y upon PROTOCOL rather than upon the concatenation of both PROTOCOL and TR];AL. But the STATUS colUJllll denotelS the status of the TR:rAL within th., PROTOCOL for the stUdy. :rt is dependent upon the entire primary kay and i t i .. 1egitiaate in th1. table. see Figure 7 for a possib1e solution to this probl.e•• And finally, the analyst or designer .oat identify and define the r~1.tion&hiP5 whiCh eXist between the ident~fied and defined entities and their relative iaportance to the o~anization, noting the primary and secondary keys that .ay be necessary to represent these relationa. These relationships can be difficu1t to define since tha relationships are dyn_ic depending upon the situational requirements. The third for. of normaLi.ation inv01ves eliainating the depen<lence of any non-key field upon any other fie1ds eXCtlpt the pri_ry key. No two non-priaary key eolu.ns i.... a tabl.e may be rel.ated 1n a one-to-one IIUUlner (transitive dependence). V101atinq the thi:rd normal fora will cause the same symptoms aa when vio1ating the first and second normal forma. The above is an i terati ve proc_s. Deciding what entitles and attributes are relevant to our application ~s a very iaportant and creative aspect of .yate. analyai. and desiqn. :rt helps a l.ot to know the busio_a or industry in Which you are working, and i t helps even 1IIOre to have a broad understanding about how your potential user. think. A.a illustrated in Figure . , the STUDYTEAM tabl. .. , the prob1e» here is that have two non-key fiel.ds dependent upon _eh other rather than upon the prilD4ry key field, EKPLOYEE-coDE. TEAll and LEAl)ER-CODE are transi t i vel.y dependent upon one another. Xaking a change to either of these two flelds without si.u1taneous1y making appropriate cbange to th.. othar .ey introduce anomali . . into 'our databa• .,. w.. DATABASE NORMALIZATION One possib1e solution is to to re-structure the table into .uJ.t1p1e tabl.es . . in Figure 9. The relational model J»:ovidas a number of powerful analytic tools that offe:r invaluable assistance in designing and optilDizing the data structures for our applications. The to01s are called 'Ute nor_I fo~, and the process of applying them i. called no:r:aa11zation. We could 90 on normal.is1ng to the nth fOrIQ. But further nora.11zing is iJapractical and unnece68ary. For lIIOrt!! detailed normal.ization reaearch, :r WOUld recommend that the readeX' refer to ' A Relational. Hode1 of ))ata for Larqa Shared. Delta Banks.' by Edqar.F. Codd publ.ished in COlllBlUDications of th& ACM, V01 13 Ko. 6~ June 1970, 'R&1ational Date-bases: The ».eal. story' by Steven J. Vaughan-Nichqls in BYTE, Vol. 15 No. 14 ,December 1990 .. and 'rnterfacing Normalized Relational Database Structures With SAS SOftWare' by Ja_a R. Johnson and R~r O. Oo~ejo of Glaxo pub11shed in &AS Users Group International 15th Annual. Conference proceedings (pp.421-427). NOrllUl.l.ization is a f~rII!ll., systemat1c procedure for t • • tinq your tentative data atructure against a series ~~:,!;;t::fn~~,:~es un~:ly'i~iV~!og:::!iro::~1Il d~: BOdel. :rt insures that the data IIIOde1 i& in the most stable(i.e.; least like1y to change) foru. :rf we use the ru1es of nor_lization to create data structures, we ara l.ikely to save a lot wo:t:k 1n the develop_nt proceSS. Four 1.portant properties are associated with a that represents a rel.ation in a database: l.. 2. 3. 4. CONCLUSION ~ No two rows can be identical. The ordering of thase rows is inaignificant. The ordering of the col.u.ns is inSignificant. Each data ita. "ould be atomic(i.e., nondeao.posable, such as an integer or character etring). :rn a relationa1 database, even the 1II0St cOlllpl.icated data relatioft8hlpg can be reduced to two-dimensional table :~~:s~ .:~:8 ~:nq1~~An~o~;f~;r~:ninf~:=~~n ea~!~ than Lt is under other data IIIOdeLs. ::::'1fl~~~.fo~~r~~:~t.!,~s :::::ii~i~~11;it~ i:i~~!~:~ But.. nOrll\B.lization is not ill panacea.l :rt will not validate the selection of ent~ties in the data mode1~ detact redundant entities, or validate entity relationships. What practica1 norma1ization wi11 do 1s .in1.ize data uainten~nca anomal1es; .inimlze data £he potential r.dundacies. ~~unt:n~ e=~e s~ie th:O~:te~dl ::!:-:illlS.~ru~!~e: for data maintenance anomalies and 1IIlllt1ple ent1ties whlch were erroneously colllbined 1nto ona entity in the data 1II0del(third normal. form violation) . The normal.ization process consists of sequential. steps for refining the data model.. A design; whicb has been refined through the first step is said to be in first nor.al fOrB. If the second step is COlllpleted; the de.iqn is in the second norlllal forJ;i. The first, second, and third nOrmal fora&. yield substantial pract1.;!al benefits. In general; the goa1 of a rel.tional database design is to generate a set of relat10.... schames that a110w us to :tt~: ~f~~!~~re.::tt:::rm.~~~~es:a~1).y~ed~~nth~ ~e; of SQL; end-users would now be able to access data in an ad hoc way. Complying with the first nOrll""l for. requires elimination of repeating columns or ~oups of columns frOB each tab1e. All columns in a t""ble must he related to the primary key of that table in a one-to-one .anner. :rn Figure 4 the prbll.ary key of the Dl!:CG table 1s PROTOCOL. The table is in violation of the first normal. fora rule since each protocol _y have more than one 1nvestigator. PROTOCOL and DRUG is repeated for PROTOCOL 999. Acknowledgements r wish to thank Gail Scherer and Iraj Mohebalian for their technical review of this paper. As ill.ustrated in Figure 4, violating the first normal fora rul.e may cause data redundancy since entire rows aust be repeated to al.low for mul.tiple values of tha incorrectly placed coluant inability to completely id~ntify what the incorreatl.y pl.aced coLumn represents; and _intenanee difficulties(i.e.; if adding a new investigator means adding a new row; all the protoco~'s attril:n1tes(e.9., drug name) IllUSt be completel.y reentered. :rt becomes apparent that 1nvestigator name is not lUI attribute of a drug and thus IHVES'l'NAHE shoul.d not be a column in the DRUG table. SAS is a r-eq1B~ed trademark of SAS II. rl'lqisterad tradltXllI.rk of ORACLE 1s Figure 5 showS the DRUG table restructured to correct for tbe first nOr1!lal. form. viol.ation by deleting the :rHVESTHAME col.umn. Of course~ an investigator entity wou~d have to be created with its appropriate attributes. The second degree of noraal.ization involves tabl.es that have cO!!!pOsite primary keys(PK) (i.e., priJaary keys that are constructed by concatenation of two or .are secondary keys). Violating the second normal. form. ru1e .ay cause: data redundancy due to one or .are columns containing frequentl.y repeating data valu",:!!; _intenance 519 T"",titub. :r"o., Ca~. ltC', USA. oracle Corporation. EKGABNOR TABLE Figure 1 PROTOCOL TRIAL PTNO VISIT EKGABN N,NN,PK N,NN,PK N,NN,PK N,NN,PK N 999 1 1 3 230 888 1 2 3 105 999 1 2 3 102 999 2 1 4 609 Sample EKG Abnormality Table VITALS TABLE PROTOCOL TRIAL PTNO VISIT OBP SBP HRT WGT N,NN,PK N,NN,PK N,NN,PK N,NN,PK N,PK N N N 999 1 1 0 82 142 72 200 999 1 1 3 88 132 80 195 999 1 2 3 96 138 68 155 Figure 2 VITALS Entity Set or Table VITEKGAB PROTOCOL TRIAL PTNO VISIT OBP SBP HRT WGT EKGABN N,NN,PK N,NN,PK N,NN,PK N,NN,PK N N N N N 999 1 1 3 88 132 80 195 230 999 1 2 3 96 138 68 155 102 Figure 3 Relationship between VITALS and EKGABNOR tables 520 DRUG TABLE Figure 4 PROTOCOL DRUGNAME INVESTNAME N,NN,PK C C 888 Primon Doe 999 Bacrinol stein 999 Bacrinol Johnson 999 Bacrinol Miller Violation of First Normal Form Rule DRUG TABLE Figure 5 PROTOCOL DRUGNAME N,NN,PK C 888 primon 999 Bacrinol Correction of Violation of First Normal Form Rule 521 TRIAL TABLE Figure 6 PROTOCOL TRIAL DRUGNAME STATUS N,NN,PK N,NN,PK C C 888 1 Primon CONT 888 2 Primon WITHDR 999 1 Bacrinol CONT 999 2 Bacrinol CONT Violation of Second Normal Form Rule TRIAL TABLE Figure 7 PROTOCOL TRIAL STATUS N,NN,PK N,NN,PK C 888 1 CONT 888 2 WITHDR 999 1 CONT 999 2 CONT Correction of Violation of Second Normal Form Rule 522 STUDY TABLE Figure 8 EMPLOYEE-CODE TEAM C,NN,PK C,NN C 426111817 A888 359111111 357888222 B888 426111817 111224444 Z921 431445555 999999999 Z999 888998888 LEADER-CODE Violation of Third Normal Form Rule STUDY TABLE STUDY-CODE STUDY-NAME C,NN,PK C C 111 Tacrine1 426111817 222 Pirmenol1 222334444 333 Tacrine2 888998888 444 Quinapril1 345678901 LEADER-CODE STUDY-TEAM TABLE Figure 9 EMPLOYEE-CODE STUDY-CODE C,NN,PK C,NN,FK 426111817 111 357888222 111 111224444 333 999999999 444 Correction of Violation of Third Normal Form Rule 523