Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational algebra wikipedia , lookup
Clusterpoint wikipedia , lookup
ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles Motivation: Temporal Applications Most database applications are temporal in nature: Financial applications Record-keeping applications Scheduling applications Scientific applications Temporal Databases: the Reality Over 40 temporal data models and query languages have been proposed in the past A long struggle to get around the limitations of RDBMS No DBMS vendors have moved aggressively to extend SQL with temporal support What’s Needed? A temporal database system that provides: Expressive temporal representations and data models with minimal or no extension Powerful languages for temporal queries with minimal or no extension Indexing, clustering and query optimization techniques for efficient query support Architectures that bring these together Outline Motivation Viewing Relation History in XML Temporal Queries with XQuery The ArchIS System Performance Study Database Compression Conclusion Background: Publishing Relational Database as XML Publishing relational DBs as XML as actual XML documents: SQL/XML as XML views: SilkRoute, XPeranto Viewing Relation History in XML Our proposal: view the history of relational DBs as XML documents: Such history can be naturally represented in XML, without any extension to the data model Temporal queries can be expressed in XQuery as is—without any extension to the language Amenable for efficiently implementations Temporal Grouping in XML Temporal data models can be classified as: Temporally ungrouped Temporally grouped Temporally grouped data models have more expressive power and are more natural for users It is difficult to fit temporally grouped models into RDBMS Temporally grouped data model can be represented well in XML Example: Transaction-Time History of Tables Timestamped tuple snapshots (temporally ungrouped) deptno DOB start end Engineer d01 1945-04-09 1995-01-01 1995-05-31 70000 Engineer d01 1945-04-09 1995-06-01 1995-09-30 10003 70000 Sr Engineer d02 1945-04-09 1995-10-01 1996-01-31 10003 70000 Tech Leader d02 1945-04-09 1996-02-01 1996-12-31 name empno salary title Bob 10003 60000 Bob 10003 Bob Bob Temporally grouped history of employees name empno salary 60000 1995-01-01:1996-05-31 Bob Engineer 1995-01-01:1995-09-30 deptno 70000 1995-10-01:1996-01-31 Tech Leader 1995-06-01:1996-12-31 1996-02-01:1996-12-31 DOB d01 1995-01-01:1995-09-30 Sr Engineer d02 10003 1995-01-01:1996-12-31 1995-01-01:1996-12-31 title 1945-04-09 1995-10-01:1996-12-31 1995-01-01:1996-12-31 XML Representation of DB History <employees tstart="1995-01-01" tend="1996-12-31"> <employee tstart="1995-01-01" tend="1996-12-31"> <empno tstart="1995-01-01" tend="1996-12-31">10003</empno> <name tstart="1995-01-01" tend="1996-12-31">Bob</name> <salary tstart="1995-01-01" tend="1995-05-31">60000</salary> <salary tstart="1995-06-01" tend="1996-12-31">70000</salary> <title tstart="1995-01-01" tend="1995-09-30">Engineer</title> <title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title> <title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title> <deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno> <deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno> <DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB> </employee> <!-- … --> </employees> Advantages of XML Representations The attribute value history is grouped, and can be queried directly The H-document has a well-defined schema generated from the current table The interval constraints are maintained in the updates Outline Motivation Viewing Relation History in XML Temporal Queries with XQuery The ArchIS System Performance Study Database Compression Conclusion Temporal Queries with XQuery XQuery: the coming standard query language for XML With XQuery, we can specify temporal queries without any extension: Temporal projection, snapshot queries, temporal joins, interval queries queries: A SINCE B, continuous periods, period containment Complex Temporal Queries with XQuery Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s } Snapshot queries: retrieve the departments on 1996-01-31: for $d in doc("depts.xml")/depts/dept [tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"] let $n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"] let $m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>= "1996-01-31"] return( element dept{$n,$m } ) Temporal Functions Shield the user from the low-level details used in representing time, e.g., “now” Eliminate the need for the user to write complex functions, e.g., coalescing Predefined functions: Restructuring: coalese($l) Period tmeets comparison : toverlaps, tprecedes, tcontains, tequals, Duration and date/time: tstart($e), tend($e), timespan($e) telement(Ts, Te): constructs an empty element element timestamped as tstart=Ts, tend=Te Support for ‘now’ ‘now’: no change until now Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31 Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31 In the output, tend value can be: “9999-12-31” CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999-12-31 with the current date, “now”, using externalnow($e) that recursively replaces all the occurrence of \9999-12-31" with the string \now". Outline Motivation Viewing Relation History in XML Temporal Queries with XQuery The ArchIS System Performance Study Database Compression Conclusion The ArchIS System Two approaches are possible for storing and querying Hdocuments (H-views) Native XML database approach: store H-documents directly into XML DB XML-enabled RDBMS. Design issues include: mapping (shredding) the XML views representing the Hdocuments into tables (H-tables) translation indexing, of queries from the XML views to the H-tables clustering and query mapping techniques ArchIS: Archival Information System The ArchIS System: Architecture Current Database Relational Data SQL Queries Active Rules/ update logs Temporal XML Data H-views (H-documents) H-tables Temporal XML Queries A R C H I S H-tables Assumptions Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e.g., employee: empno H-tables: attribute history table: store history of each attribute key table: built for the key global relation table: record the history of relations e.g.: current database: employee(empno, title) name, sex, DOB, deptno, salary, H-tables current table employee (cont’d) H-tables relations(relationname, tstart, tend) global relation table empno key table employee_id(id, tstart, tend) name attribute history employee_name(id, name, tstart, tend) … table … salary employee_salary(id, salary, tstart, tend) title employee_title(id, title, tstart, tend) H-tables (cont’d) Sample contents of employee_salary: ID ======= 100022 100022 100022 100022 100022 100023 ... SALARY ======= 58805 61118 65103 64712 65245 43162 TSTART ========== 02/04/1985 02/05/1986 02/05/1987 02/05/1988 02/04/1989 07/13/1988 TEND ========== 02/04/1986 02/04/1987 02/04/1988 02/03/1989 02/03/1990 07/13/1989 Updating Table Histories Changes in the current database can be tracked with either update logs or triggers DB2: triggers ArchIS: update logs Query Mapping General purpose query mapping: XPeranto In ArchIS, we have well-defined mapping between H-documents (or H-views) and Htables We map temporal XQuery queries into SQL, utilizing SQL/XML SQL/XML is a new standard to map between RDBMS and XML Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient SQL/XML Publishing Functions XMLElement and XMLAttribute select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ <dept tstart = "02/04/1985" tend = "12/31/9999"> Sales </dept> XMLAgg select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e.name)) from employee_name as e where e.tstart >= ‘02/04/2003’ <new_employees Since = "02/04/2003"> <employee>Bob</employee> <employee>Jack</employee> </new_employees> XQuery Mapping to SQL with SQL/XML Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s } select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id XQuery Mapping to SQL with SQL/XML: Steps Identification of variable range Map variables in FOR/LET clause into underlying Htables Generation of join conditions There is a join condition any pair of distinct tuple variables: join them by ids Translation of built in functions Map built-in temporal functions in XQuery into functions in ArchIS Output generation use XMLElement and XMLAgg constructs Temporal Clustering and Indexing Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed For every segment, usefulness: U = Nlive/Nall At the beginning, U =100%, and it decreases with updates The minimum tolerable usefulness: Umin Segment-based Clustering Scheme Live Live All All All Segment 1 Segment 2 Segment 3 segstart1 segend1 segstart2 segend2 tstarttuple <= segendSEG tendtuple >= segstartSEG segstart3 segend3 Segment-based Clustering Scheme Initially all tuples for an attribute history table are archived in a live segment SEGlive with usefulness U =100%. With updates, when U drops below Umin: 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEGlive are copied into a new live segment SEGlive', and the old live segment is dropped; After that, the new segment SEGlive’ becomes the new starting segment for updates Segment-based Clustering Scheme (cont’d) Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999 ... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999 ... Advantages of Segment-based Clustering Scheme The current live segment always has a high usefulness, assuring efficient updates; Records are globally temporally clustered on segments; For snapshot queries, only one segment is used; for interval queries, only segments involved are used; Flexibility to control the number of redundant tuples in segments with Umin Storage Usage of Segment-based Clustering Relative storage size with different Umin 1.6 Storage Size Nseg <= N0/(1-Umin) 1/(1-Umin) Testing Data(Umin=0) 1.7 1.5 1.4 1.3 1.2 1.1 1.0 0.0 0.1 0.2 Umin NS 0.3 0.4 Query Performance on Temporal Data with Segment-based Clustering ArchIS without segment-based clustering ArchIS with segment-based clustering Queries: Point: Q1 Snapshot: Q2 Interval: Q5 History: Q3, Q4, Q6 5 Seconds 4 3 2 1 Q1 Q2 Q3 Q4 Q5 Q6 Outline Motivation Viewing Relation History in XML Temporal Queries with XQuery The ArchIS System Performance Study Database Compression Conclusion Performance Study: Experimental Setup Systems: Tamino, DB2, and ArchIS ArchIS uses BerkeleyDB as its storage manager, and it builds on top of it a SQL query engine Temporal data set: the history of 300,024 employees over 17 years The simulation models real world salary increases, changes of titles, and changes of departments The size of the XML data is 334MB The single large XML document is cut into a collection of 15,000 small XML documents with around 25KB each Machine: Pentium IV 2.4GHz PC with RedHat 8.0 Performance Study: Query Performance 100 DB2 ArchIS Tamino DB2 and ArchIS: with clustering Tamino: without clustering Seconds 10 1 0.1 Q1 Q2 Q3 Q4 Q5 Q6 snapshot query Q2 on ArchIS is 137 times faster than that on Tamino; interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster, and Q3 near 3 times faster. Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41 times slower than archIS); interval query Q5 is 2.9 times faster than without clustering ( still 31 times slower than on ArchIS); history queries are much slower Storage Utilization Compression Ratio 1.5 1.0 0.5 0.0 DB2 ArchIS Tamino (with compression) Outline Motivation Viewing Relation History in XML Temporal Queries with XQuery The ArchIS System Performance Study Database Compression Conclusion Database Compression The disparity between CPU/memory and disk speeds is becoming larger and larger Cost to read one IDE disk page: 14ms Cost to uncompress one page: 1.1ms(500MHz CPU) 0.26ms(2.4GHz CPU) Cost to retrieve one compressed page: 14ms + 0.26ms = 14.3ms Cost to retrieve uncompressed pages (3.6 pages): 14ms x 3.6 = 50.4ms Page-based Compression: PageZIP Traditional data compression tools: compress a file as a whole PageZIP: page-based compression and uncompression at the granularity of a page Based on gzip library: zlib Benefit: save space; point, snapshot or interval queries only retrieve a small fraction of the history, and can be efficient PageZIP page 1 page 2 page 3 … Segment 1 … Segment n ID: 1001 - 1100 ID: 1100 - 1203 ID: 1203 - 1331 Storage Utilization with Compression For each attribute history table, we compress it as a sequence of pages and store each page as a BLOB in a RDBMS employee_salary (sid, salary, tstart, tend) => employee_salary_blob(pageno, startsid, endsid, pageblob) without compression 1.5 Compression ratio with compression 1.0 0.5 0.0 Tamino DB2 ArchIS Query Performance with Compression DB2 without compression DB2 with compression Tamino Seconds 100 ArchIS without compression ArchIS with compression 10 1 Q1 Q2 Q3 Q4 Q5 Q6 Update Performance For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together Update an employee’s salary: DB2: 0.29 seconds; Tamino: 1.2 seconds Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average DB2: 1.52 seconds; Tamino: 15 seconds In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once Summary We built a transaction time temporal database on RDBMS and XML, with: XML to support temporally grouped (virtual) representations of the database history XQuery to express powerful temporal queries on such views temporal clustering for managing the actual historical data in a RDBMS SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB compression as option for efficient storage ArchIS provides a unified solution for a wide spectrum of temporal application problems Future Work Friendly temporal query interfaces based on temporally grouped models Other clustering and indexing techniques to be investigated Other efficient data compression techniques proposed for XML data to be investigated Apply the approach to valid-time DB and bitemporal DB Apply the approach to OODBMS and semistructured data model