Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Relational algebra wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Ingres (database) wikipedia , lookup
Object-relational impedance mismatch wikipedia , lookup
Clusterpoint wikipedia , lookup
Relational model wikipedia , lookup
Part 19 Implementation Performance Efficiency Depends On: Physical data storage Use of indices Query optimization Compiled vs. interpreted execution Ability to predict database usage, communicate that prediction to the DBMS, and make use of that information Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 2 Physical Data Storage Flat files/tables can be inefficient PROBLEM: Access to an employee and all assigned tasks may require one physical disk access per task Possible Solutions: use a hybrid DBMS (hierarchical, network) (different physical and logical) store first normal form relations store in master/detail (JoinDef) form ("array" within relation) One relation per entity can be inefficient PROBLEM: 80-20 rule states that 80% of all retrievals will occur against 20% of the attributes (e.g. emergency contact) Possible Solutions: multiple relations per entity, cluster attributes separate access mechanism for each attribute Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 3 Table Organization Default organization in most systems is a "heap" non-sequential file, usually in order of input Organizations commonly available: heap cheap heapsort cheapsort hash chash btree cbtree isam cisam sorted non-sequential, duplicates, new records at end compressed heap sorted at modify, maintained as heap compressed heapsort random hash table, no duplicates compressed hash dynamic B-tree, no duplicates compressed btree static indexed-sequential, no duplicates compressed isam maintained as sorted sequential Ingres allows table organization to be established via: MODIFY emp TO chash UNIQUE ON name, WHERE FILLFACTOR = 50; Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 4 Use of Indices An index is used to speed up retrieval aid "associative" retrieval by rapidly mapping values to locations can answer existence questions without data access “no free lunch” principle - slows updates General index types Sequential - useful for range queries Direct - useful for list queries Specific index types sorted sequential direct relation B-tree hashing pointer chains bit maps Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 5 Index Creation 1. Create an index CREATE INDEX rateindex ON emp(rate) generates a table of the form rate pointer In Oracle, this new index table is sorted and can be used immediately. In Ingres, this index table is a “cheap” and is useless until modified. 2. In Ingres, organize the index MODIFY rateindex TO Btree on rate; 3. In Oracle, automatically get an index when you specify that a field is a PRIMARY KEY or UNIQUE 4. Use of the index is selected by query optimizer Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 6 Query Optimization REQUIRED in any mainframe / mini environment to get acceptable performance OPPORTUNITY to capitalize on the strengths of the relational model System free to decide which record-level operations are needed use information available to it data values history of access select from a wide variety of alternatives EXAMPLE: SELECT DISTINCT s.sname FROM s, sp WHERE s.s# = sp.s# AND sp.p# = 'p2'; (100 suppliers - 10,000 shipments - 50 shipments of p2) Method 1: Generate Cartesian product, then restrict Method 2: Restrict sp before join Method 2 is 1/2 of 1% the work of Method 1! Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 7 Query Optimization Process 1. Cast query into an internal representation SQL has many ways to state the same query usually cast as an abstract syntax tree 2. Convert query into a canonical form apply rules for restating query usually converted to conjunctive normal form p or (q and r) --> (p or q) and (p or r) 3. Choose candidate low-level procedures consider availability of indices, physical locations of data, and size of relations 4. Generate query plans and choose the cheapest can be done at compile time or run time estimate the number of disk accesses or use a rulebased system don't generate/consider all combinations Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 8 Index Selection Query optimizers generally try to use indices when available to speed retrieval. However, indices slow update speed and take disk space. This is further complicated by the fact that you can index combinations of columns in addition to single columns. So if you have a table with 12 columns, there are over 1.3 billion possible column combinations that can be indexed in this one table alone. In addition, each combination of columns can be indexed multiple times, using various index organizations. Below is a set of “rules of thumb” to use when setting up indices: 1. If you are doing almost exclusively data entry, forget about indices until you start doing queries. 2. Index your primary keys. (For example deptno in table dept, and the combination of ename, project_id, and tname in table task.) 3. Index any field(s) referenced as foreign keys. (For example mgr and deptno in table emp) 4. Index any fields used in a significant number of WHERE clauses, provided that no one value occurs in more than 20% of the rows. (For example, if there were lots of retrievals on the task table both by hours and by tname, hours would be better than tname.) Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 9 Mathematical Sidelight on Number of Possible Indices For 1 field, there is only 1 index possible. For 2 fields, a and b, you can index a, b, ab, ba, so 4 indices are possible. For 3 fields, a, b, and c, you can index a, b, c, ab, ba, ac, ca, bc, cb, abc, acb, bac, bca, cab, cba, so 15 indices are possible. In general, you can create n + n(n-1) + n(n-1)(n-2) + … + n! indices on n columns. This can be re-written as n!(1/(n-1)! + 1/(n-2)! + 1/(n-3)! + … + 1/(1!) + 1/(0!)), but the sum in the parentheses approaches e = 2.718281828… as n gets large, so the number of indices is approximately e(n!). The table below gives the exact values for all small values of n. # of columns Number of indices possible 1 1 2 4 3 15 4 64 5 325 6 1956 7 13699 8 109600 9 986409 10 9864100 11 108505113 12 1302061346 Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 10 Selection of Views Reports select report items, do entry-level derivations Frequent queries include commonly retrieved combinations, predefine common joins Logical groups of users apply security to views pre-select only items of interest to the group Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 11 Semantic Disintegrity EXAMPLE (Martin): Query: List all incidents that were reported on passenger Jones' voyage: passenger passenger# incident incident# passenger_name address incident_name voyage# details ... join voyage# ... This join is valid because every incident on passenger Jones' voyage is needed Query: List all projects that employee Jones works on: employee employee# employee_name project project# project_name address details department# join department# ... This join is not valid because employee Jones does not, in general, work on every project in the department Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 12 Read-Only Database Database is created all at one time from another file / database Batch reports are generated perhaps at database creation On-line queries are performed no updates, but maybe new tables Database is destroyed after limited lifetime from hours to a month Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 13 The Need for Multiple Databases In general, it is not reasonable to have only one copy of the database serve the multiple purposes of: On-line transaction processing and update Regular reporting Ad-hoc queries Application development Beta-testing new software Stress testing applications Training Rapid prototyping Therefore, it is reasonable to generate multiple databases, not all with identical content. Suggested databases to develop are: Production Copy - for on-line transactions/update Read-Only Copy - for reporting and ad-hoc queries Scaled-down Sample - for development and testing Exceptional Cases - for stress testing Tiny Sample - for training and prototyping Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 14 Applications of Read-Only Databases Downloads Historical data Extracts Freeze-Frame Copy for performance reasons External version of internal data Enhance capabilities of non-relational database Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 15 Reasons to Identify Read-Only Situations Security Only include what needs to be read No risk of altered data Access not delayed by updates Integrity No risk of update creating an anomaly Simplicity Can create tables that directly relate to reports Performance Can store data in “processed” form Altered design criteria Normalization is required to eliminate update anomalies - what updates? Copyright 1971-2002 Thomas P. Sturm Implementation Part 19, Page 16