Download Flawless Logical to Physical Data Model Transformations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Oracle Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Flawless Logical to Physical Data Model
Transformations
Bert Scalzo, PhD.
Database Domain Expert
Quest Software
[email protected]
Data Modeling has been around for more than 35 years, and yet data modeling issues
remain as one of the top reasons why many database implementations fail. Sometimes
they fail from a total lack of having been modeled al all (or as an afterthought), and just
as often for having been modeled incorrectly. It’s this latter scenario this paper will try to
address. We’ll review the top 10 most common modeling issues faced when transforming
data models from logical (i.e. business requirements) to physical (i.e. implementation).
One key thesis to remember is that data modeling software does not alleviate users from
knowing data modeling theory and good form. Very much like word processing software,
today’s data modeling tools simply assist with the task at hand. If you don’t understand
the English language, sentence construction, paragraph construction, word meaning and
numerous other linguistic issues, then a word processor will simply help you write some
gibberish much quicker. Sure it may offer some tools such as a spell checker, but all that
would do is permit sentences such as this:
The buoy eight the mouse for brake fast.
Now some more advanced word processors may also offer some basic sentence checking
capabilities that would flag the prior example as a “sentence fragment” (i.e. lacks a verb),
one could correct that and yet still have a meaningless sentence such as this:
The buoy ate the mouse for brake fast.
Microsoft Word 2003 does not flag this sentence at all, yet we all see just how bad it is.
The same is true with data modeling – the software can perform some basic checks, but
cannot guarantee that it makes sense. And that failure can occur along two dimensions:
effectiveness (i.e. business requirements) and efficiency (i.e. service level agreements).
So we cannot become overly reliant on data modeling tools to prevent mistakes, lest we
simply crank out bad databases faster.
One of the key areas where mistakes are commonly made is during the logical to physical
data model transformation process. So let’s examine the ten most common issues for this
critical process.
1. Many to Many Relationships
It’s not uncommon during the logical modeling process to encounter many to many
relationships such as this:
So why does this happen? Quite often because when interviewing business people, they
assume that we all know the business equally well. Therefore it’s not uncommon to omit
some critical business facts as superfluous information that everyone simply knows.
But we cannot implement many to many relationships in a relational database! You can’t
have foreign keys in both CSUTOMER and PRODUCT pointing to each other. In other
words you cannot be a parent and child at the same time. Thus many data modeling tools
will automatically attempt to resolve this as follows:
But does this really reflect genuine business realities? If we just keep asking lots of dumb
questions, we may well find that there are additional entities and attributes that result in a
more complex and more accurate data model such as this:
Failure to ask such questions to obtain such additional details will result in a failure to
meet the fundamental business requirements (i.e. effectiveness). And there will be no
way to optimize or tune your way out of this trap.
2. Avoid Partial Unique Keys
Many data modeling tools will blindly permit the DBA to make questionable changes in
the physical model – under the assumption that the human is the expert who knows best.
One such example is allowing partial unique keys – which can sometimes surprise people
on how the database handles such a construct.
Unique identifiers are all candidates to be the primary unique identifier, and as such must
meet all the same criteria. Yet it’s not uncommon to see people try to make the alternate
unique identifiers somehow less robust – for example by permitting optional portions of
the value such as shown below.
This is simply put wrong logically – and some data modeling tools’ check reports will
find such mistakes. But let’s assume we leave it in place and build a physical model off
this example, which would result in the following physical model:
So how does your database platform handle this? You may very well be quite surprised.
Oracle for example does not enforce multi-column constraints on rows where some of the
columns contain null values! That means Oracle will permit duplicate rows even though
you have a live unique key constraint. And duplicate rows generally mean compromised
business requirements (i.e. effectiveness). And we cannot tune nor optimize plain wrong.
3. Avoid Candidate Key Loss
Many data modeling tools will blindly permit the DBA to make questionable changes in
the physical model – under the assumption that the human is the expert who knows best.
One such example is allowing candidate key loss. And quite often this technique is too
liberally applied during the tuning and optimization process for seemingly good reason.
Below we have a logical model where business requirements dictate four different ways
to uniquely identify an occurrence of the data. So we have a primary unique identifier and
three alternates. Seems simple enough – right?
Most data modeling tools will create an initial physical design such as shown here:
So what can happen to compromise this? Well it’s not uncommon for DBA’s to look for
indexes to drop as part of the tuning and optimization exercise – since every live index
basically doubles the IO necessary for a DML operation. But if we drop indexes IX2, IX3
and IX4 to improve the DML IO efficiency – we’ll totally compromise effectiveness.
4. Avoid Surrogate Key Loss
Many data modeling tools will blindly permit the DBA to make questionable changes in
the physical model – under the assumption that the human is the expert who knows best.
Hmmm, that’s the third time I’ve stated this – so it seems like a pattern is emerging 
One such example is allowing surrogate key loss. And quite often this technique is too
liberally applied during the tuning and optimization process for seemingly good reason.
Below we have a logical model where business requirements dictate that the MOVIE
RENTAL entity’s data is uniquely identified by its relationship to both its parent entities
(i.e. dependent relationships). This is a fairly common occurrence in logical models.
Most data modeling tools will create an initial physical design such as shown here:
It’s not uncommon for the DBA to replace expensive composite primary keys such as the
above with a simpler and more efficient surrogate (i.e. artificial) key such as shown here:
So what’s wrong with this? Well we do have a unique primary key, but look – we now no
longer enforce the business requirement that any given Rental ID, Customer ID, and
Movie ID are unique. The correct model would have been as follows:
But that means two indexes instead of just one. So what can happen to compromise this?
Well it’s not uncommon for DBA’s to look for indexes to drop as part of the tuning and
optimization exercise – since every live index basically doubles the IO necessary for a
DML operation. But if we drop the AK1 index to improve DML IO efficiency – we’ll
totally compromise effectiveness.
5. Avoid Partial Foreign Keys
This is probably the single most frustrating issue – since it seems to happen all the time.
Referential Integrity requires that for each row of a child table, the value in the foreign
key exactly matches the value in the parent’s primary or unique key. Note this does not
say partially – the assumption is completely.
Note that ENTITY1 has a composite primary unique identifier. So when converted to a
correct physical model, we get the following:
But some data modeling tools will permit the DBA to make the following change:
Look what happened – we made a portion of the composite column foreign key optional!
Remember the earlier issue with partial unique keys? Well for some database platforms
this is the same problem – partial constraints are not enforced. So for Oracle, rows that
contain some null and some not null values are simply not enforced. That means bogus
data.
Yet developers ask for this all the time – and DBA’s seem to capitulate far too easily 
6. Avoid Indirect Foreign Keys
Sometimes a seemingly innocuous and harmless design change can easily backfire.
Here we have a simple logical model with three entities – where there is no business
requirement to relate ENTITY1 to ENTITY3.
Most data modeling tools will create an initial physical design such as shown here:
However sometimes developers will need to code a SELECT statement where they use
attributes from ENTITY1 and ENTITY3 – but not from ENTITY2. So they may ask a
seemingly reasonable question – why do I have to do a three way join for data from just
two tables? In other word, why can’t we treat the physical model and database is if the
following were true – that ATTRIBUTE1 can point to both parents.
Well if the developer simply codes it – no real damage done. But if the DBA actually
implements the additional foreign key for RELATIONSHIP3, then we’ve traded off
DML performance – because we’ll now have an additional and very expensive check
performed to verify that ENTITY3 data points to legitimate data for both parents. I
designate this as a very expensive because we’re essentially performing an extra check
per DML operation that by the very nature of the inherited foreign key columns will
always pass! If it will always pass – why do it?
7. Avoid Bogus Foreign Keys
This is probably the second most frustrating issue – since it too seems to happen all the
time. It’s especially an issue when working with legacy system data. But it’s just plain
simply wrong.
Referential Integrity requires that for each row of a child table, the value in the foreign
key exactly matches the value in the parent’s primary or unique key. Note this does not
say that foreign keys relate back parent non-key columns.
Some data modeling tools will permit the following design:
Where the relationship’s foreign key is defined like this:
We’ve made two mistakes here. First, we’ve dropped the properly inherited column
ATTRIBUTE1 which was the legitimate foreign key. Second, we’ve made another
ENTITY2 attribute point back to a non-key back in ENTITY1 
This is just so wrong on so many levels that it’s tough to discuss rationally. Yet I see this
mistake nearly everywhere and every time. And people argue vehemently to be permitted
to do this. I just don’t get it. I’m not arguing for academic perfection for its own sake. I
simply cannot understand what relating a table to another via a non-key column means?
8. Problematic Relationships
Just as with people and their relationships, data models can sometimes bring together
relationships that appear reasonable – but that simply cannot work. Fortunately most of
these patterns are fairly easy to spot. We’ll examine the more common ones.
Here’s the first example:
Seems reasonable at first, doesn’t it? Is this not a simple employee to manager type of
relationship? Yes, but we’ve made a critical recursive mistake – there is no bottom. I
don’t manage anyone at Quest, so there is no one below me.
The second example is simply the opposite:
Now we have no top. Most CEO’s do not report to anyone - expect maybe their spouse 
And let’s not forget my personal favorite – the age old question of which came first:
Don’t laugh. Such circular logic can occur more easily than one would think. I’ve seen
people work around this last scenario by using deferrable referential integrity constraints
(i.e. the constraints on each side do not fire until a COMMIT occurs).
Finally there’s the question of Unification (and that’s not some crazy church headed by
the reverend sun moon). Here we have complex logical model with multiple dependent
relationships being combined at yet a deeper level:
Most data modeling tools will create an initial physical design such as shown here:
So when we translate this into a physical model – how do we handle ENTITY4? Will
there be just one occurrence of the inherited ATTRIBUTE1 or two? When in doubt,
always ask the business. Don’t simply make a “gut call” decision on what seems like
such a slam dunk, no brainer. Never let the tail (technology) wag the dog (business).
9. Using Normal Forms
One of my favorite movie lines is from Mel Brook’s “Young Frankenstein”, when the
good doctor asks Igor whose brain he stole – and he replies “Abby” something. That’s
when the doctor learns that he’s put an abnormal brain into his creation.
Why did I bring that up? Because I’ve yet to see a data model where the designers claim
they’re in third normal form stand up to even just cursory inspection. I’m not preaching
that normalization is always appropriate or what level is best. I’m merely stating that if
one claims to be in a given state, it should in fact be true.
Thus normalization is another key area data modeling where some confusion or lack of
complete understanding can compromise either efficiency or effectiveness. Let’s review
some normalization facts and examples.
Normalization Quick Facts:
 Goal minimize data redundancy in order to lessen likelihood of inconsistent data
 Side effect of reducing data storage needs
 But is this important given today’s cheap disk…
 Useful primarily in OLTP and ODS database designs
 Normal forms are cumulative (e.g. 2NF includes 1NF)
 Easy jingle to remember: “Depends on the key, the whole key, and nothing but
the key – so help me Codd”
Here are the three most common normal forms and examples of their violations.
1NF – Attributes are all single valued, there are no repeating groups or arrays.
2NF – All non-identifying attributes are dependent on the entity's entire unique identifier
(only applies when have compound unique ID’s).
3NF – A non-identifying attribute CAN NOT be dependent upon another non-identifying
attribute.
10. Super and Sub Types
The final issue is really just a matter of choice – and knowing what the choices are, as
well as their pro’s and con’s.
Super and Sub Types (i.e. inheritance) are a very powerful logical modeling construct.
Most business users are more than capable of assisting with construction of the logical
data model – because it’s really nothing more than a static picture of their data at rest. As
one accounting manager said to me once during a business user interview, “I know my
own business whether it’s painting on cave walls or this format.”
So let’s take a very simple example – bank accounts can be either savings or checking
accounts (but not both at the very same time) – yet all accounts share some common
characteristics or attributes. Here’s a simple example of this:
So how does one physically implement this construct in the physical data model – and
why would one choose one way versus another? Now this is one place where the DBA
should choose a direction and make a critical design implementation choice. Let’s now
examine the three options:
Option #1 - One Big Table
This is probably the most common choice.





Generate Parent = Y and Generate Children = N
Requires Discriminator attribute (e.g. Account Type)
Violates third normal form (… nothing but the key …)
PRO: Easy to code against, just one big table …
CON: All child columns optional, so need table check constraint
Option #2 - Table per Sub-Type
This is another common and simple alternative.




Results in N-1 tables
Gen. Parent = N, Gen. Children = Y, Inherit All Attributes = Y
PROS: All child columns implemented as expected
CON: Two tables to code against …
Option #3 - Table per Super and Sub-Type
This choice is valid – but has some issues related to complex application coding and
resulting performance (due to excessive joins).



Results in N tables
Gen. Parent = Y, Gen. Children = Y, Inherit Only Primary Attr. = Y
NOT RECOMMENDED: Just Plain Overkill
Which should you sue? That’s up to you. But you do need to understand the three
alternatives – because there is not a one-size-fits-all answer that works every time.
Also note that not all data modeling tools support these choices the same way. Most data
modeling tools fail to create the complex constraints and triggers required by the first two
alternatives – so your DBA will have to compensate and add the missing database logic.
Conclusion
So when creating an initial physical data model for a specific target database platform
from a logical model of the business requirements, many issues remain that often require
close attention. Like word processors, data modeling tools don’t guarantee that resulting
databases will be effective or efficient. These tools merely attempt to automate the more
monotonous portions of the data modeling experience. So if we’re not well versed in data
modeling techniques and careful along the way, we could easily end up creating very bad
databases – even though we were good and used data modeling tools during the design 
Bert Scalzo is a Database Architect for Quest Software and a member of the TOAD team.
He has worked extensively with TOAD's developers and designed many of its features.
Mr. Scalzo has worked with Oracle databases for well over two decades, starting with
version 4. His work history includes time at Oracle Education and Oracle Consulting,
plus he holds several Oracle Masters certifications. Mr. Scalzo also has an extensive
academic background - including a BS, MS and PhD in Computer Science, an MBA and
several insurance industry designations. Mr. Scalzo is an accomplished speaker and has
presented at numerous Oracle conferences and user groups - including OOW, ODTUG,
IOUGA, OAUG, RMOUG, et al. His key areas of DBA interest are Data Modeling,
Database Benchmarking, Database Tuning & Optimization, "Star Schema" Data
Warehouses and Linux. Mr. Scalzo has written articles for Oracle’s Technology Network
(OTN), Oracle Magazine, Oracle Informant, PC Week (eWeek), Dell PowerEdge
Magazine, The Linux Journal, www.linux.com, and www.orafaq.com. He also has written
four books: "Oracle DBA Guide to Data Warehousing and Star Schemas", "TOAD
Handbook", "TOAD Pocket Reference" (2nd Edition), "Database Benchmarking:
Practical methods for Oracle 10g & SQL Server 2005", "Oracle on VMware: from
Laptops to Production" [2008], and "Oracle Utilities: The Definitive Reference" [2008].
Mr. Scalzo can be reached via email at [email protected] or
[email protected].