Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Decay and What To Do About It by Michael Stonebraker (With Dong Deng and Michael Brodie) My Thesis •Traditional database design is all wrong • • Gold standard techniques are not used “in the wild” and for good reasons Contributes to database decay • DBMS application development is all wrong • Contributes to database decay • There are kluges to slow down decay • Which I will talk about •Best idea is a new paradigm for app development • Which I will talk about Context •Operational databases • But “rot” is upstream from analytic databases • And “flows downstream” Traditional Database Design Wisdom •Use a modeling tool (e.g. an EntityRelationship diagram tool) •To construct/modify an E-R diagram •When satisfied, push a button •Tool spits out a relational schema in 3rd normal form •Code in ODBC/JDBC against this schema Example Supplier (sno, 1 111 sregion) N Part (pno, psize, pcolor) Supply (qty) Two entities (Supplier and Part with attributes) One relationship (Supply with attributes, which is 1 – N) Lots of Elaboration •Weak entities •Inheritance •… •But the basics is all we need Algorithm to Produce 3rd Normal Form •Relation for each entity with attributes •For each 1 - N relationship, add the key of the 1 side to the table for the N side with its attributes •For each M – N relationship, add a table for the relationship with both keys and the attributes For Our Data Supplier (sno, sregion) Part (pno, psize, pcolor, sno, qty) •Application groups write applications for this relational schema, typically using ODBC/JDBC as the interface Large Applications • Usually designed/coded by separate application groups, reporting to different managers • • • Parts controlled by engineering Supply controlled by procurement Supplier controlled by finance • In aggregate have 3 applications • Often on different development schedules and budgets Now Suppose…. •Management decides to allow multiple suppliers for each part, as long as they are in different regions • Perhaps to get better terms … •Such changes happen frequently “in the wild” • Or • Often once a quarter or more New Diagram Supplier (sno, M 111 sregion) N Supply (qty) Part (pno, psize, pcolor) Two entities (Supplier and Part with attributes) One relationship (Supply with attributes which is now M – N) New Tables Supplier (sno, sregion) Part (pno, psize, pcolor) Supply (sno, pno, qty) Supplier table is unchanged “Old Part” table is a join of Part and Supply (defineable as a view based on ”new stuff”) Summary of The Traditional Wisdom •Convert “old” database to “new” •Define “old” as views •Applications (in theory) continue to run •Database stays in 3NF (defined as goodness by research community) A Dirty Little Secret •Based on a survey of about 20 DBAs at 3 large companies in the Boston area….. • • None use this methodology or Use it for “green field” design and then abandon it •I.e. the “gold standard” is not used “in the wild” •So why not?????? • Rest of the talk is a mix of speculation and ideas Problem #1: View Update Issues •Application: change the qty for part XXX to 500 • update to “old part” •But “old part” is a view which is a join • • • Updates to such views disallowed in most RDBMSs Either because of semantic ambiguity or vendor laziness Problem #2: Application Semantics May Change •Delete the fact that supplier XXX supplies part YYY • • • In “old” this deletes part XXX Or Turns fields in the Part table to “null” •In the new world where there are multiple suppliers of part XXX, it presumably only deletes a supply tuple Problem #3: Other Applications May Fail •And they are in different departments •With different budgets Net Result •Substantial risk! • • Applications all over the enterprise must be found and corrected Often with no budget for this activity Less Risky Solution -- Kluge •Leave the old schema as is Supplier (sno, sregion) Part (pno, psize, pcolor, sno, qty) •Replicate part info, when multiple suppliers supply the same part • • Result does not conform to any ER diagram! No longer 3NF Our Two Examples Do the Right Thing • No views to deal with • Deleting a tuple in the Part table “does the right thing” • Reasonable chance that other applications will continue to run correctly The Net-Net (Our Conjecture) • DBAs want to lower risk • Therefore DBAs try not to change the schema • Our change does not conform to any ER schema • ER diagram diverges from reality, if it ever existed • Hence, DBAs deal with table schemas, not ER schemas • Table schemas become increasingly distant from 3NF • We term this process “database decay” • Which ultimately means that a total rewrite is the only way forward So What to Do? • Higher-level interfaces • Kluges • • Defensive schemas Defensive applications • A new development paradigm Higher Level Interface • ORMs • Object model • E-R frameworks • E-R model • Solve neither of our problems! • And encapsulate the SQL, so it is inaccessible Defensive Schemas • Use the 3 table version, even though initially the two table one will work • • Anticipate changes and “build them in” And code against the larger schema initially • Leave some “dead fields” in the initial schema • For later use Defensive Schemas • Think “tall and skinny” • Single table • Data (entity-name, attribute-name, value) • Or lots of binary tables • Attribute (entity-key, value) Tall and Skinny • Generally horrible performance • But survives lots of changes • And integrity control must be built into application logic • Which may need to be changed Defensive Applications • Never use Select * From XXX • If a field you don’t use is dropped, then this code dies • Put in the target list only that stuff that you actually need Defensive Applications • Use Select Sum ( …) From XXX • Even if you know there is only one result • Later on there may be more than one, and this code will continue to work • Additional examples in the paper Best Idea – Change the Paradigm • ODBC is coded all over the enterprise • Where it is unmanageable • Application programmers must all understand the schema • • Which is a big challenge For large applications with 10’s to 100’s of tables • And they must understand defensive programming • Again a big challenge Old Way App 1 App 2 ODBC/JDBC App 3 ODBC/JDBC database New Way App 1 App 2 messages App 3 messages middleware database (DBA-controlled) Processing Model •Application programmers cannot write SQL! •Instead he/she negotiates with the DBA • • • He propose a task If DBA agrees, DBA writes the SQL And defines a messaging interface for the application programmer Processing Model • Think web services • Think stored procedures • Think execute task (parameters) • We will focus on a SQL interface – since we want to figure out what will continue to run when the schema changes DBA Effectively Has •A database of • (Task, SQL) pairs •There are lots of details (multiple commands, code in between the SQL commands, transactions, …) which we ignore And…. •He constructs the initial relational schema • Perhaps iteratively •However, think tables; not some other notation • Remember schema may decay! •App programmers implement business logic, but don’t need to know the schema •DBA converts messages into SQL Can We Help? •With the initial schema •With the replacement SQL • We will start here Suppose…. •Schema is considered a collection of tables with primary-key/foreign-key relationships • Think of this as a graph • Connected by PK/FK “join paths” • At scale 10’s to 100’s In General •The parameters of each message specify locations in this graph •And we require the “shape of the stored procedure” • 1 input or N inputs • 1 output or N outputs Example •Task: Order Part XXX from Supplier YYY (generates an order of size qty) -- shape is one part and one supplier •Graph: Supplier Supply Part Example •Straightforward to (mostly automatically) generate the SQL from the message •Because there is only a single join path Supplier Supply Part When Business Conditions Change… •Change the schema • It is the “right” thing to do! •And map the SQL Example •Task: Order Part XXX from Supplier YYY (generates an order of qty) •Graph: Supply Supplier Part For Each SQL Command: •Have the old path •And the new path •And the old SQL For Each SQL Command: •Can classify the command as: • Can be run unchanged (path and shape are the same) • Path has changed, but there is a unique replacement with the same shape (new SQL can be generated automatically) • There is no new path or multiple new paths or shape has changed (human intervention required) Possible Strategies •Run the tool in advance of committing the schema change to see how extensive the changes are •Use the tool at run time to fix the SQL at 1st access st • Effectively “schema at 1 access” •Or anything in between We have a Prototype •Would love to find an “in the wild” situation •To try it on • Especially one with a lot of tables! •This is a plea for a use case!!!!