Download 1995-01-01:1996-12-31

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

PL/SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
ArchIS: An Efficient Transaction-Time
Temporal Database System
Built on Relational Databases and XML
Fusheng Wang
University of California, Los Angeles
Motivation: Temporal Applications
Most database applications are temporal in nature:

Financial applications

Record-keeping applications

Scheduling applications

Scientific applications
Temporal Databases: the Reality

Over 40 temporal data models and query
languages have been proposed in the past

A long struggle to get around the limitations of
RDBMS

No DBMS vendors have moved aggressively to
extend SQL with temporal support
What’s Needed?
A temporal database system that provides:
 Expressive temporal representations and data
models with minimal or no extension

Powerful languages for temporal queries with
minimal or no extension

Indexing, clustering and query optimization
techniques for efficient query support

Architectures that bring these together
Outline

Motivation

Viewing Relation History in XML

Temporal Queries with XQuery

The ArchIS System

Performance Study

Database Compression

Conclusion
Background:
Publishing Relational Database as XML

Publishing relational DBs as XML
as
actual XML documents: SQL/XML
as
XML views: SilkRoute, XPeranto
Viewing Relation History in XML

Our proposal: view the history of relational
DBs as XML documents:
Such
history can be naturally represented in
XML, without any extension to the data model
Temporal
queries can be expressed in
XQuery as is—without any extension to the
language
Amenable
for efficiently implementations
Temporal Grouping in XML

Temporal data models can be classified as:

Temporally ungrouped

Temporally grouped

Temporally grouped data models have more
expressive power and are more natural for users

It is difficult to fit temporally grouped models into
RDBMS

Temporally grouped data model can be
represented well in XML
Example: Transaction-Time History of Tables
Timestamped tuple snapshots (temporally ungrouped)

deptno
DOB
start
end
Engineer
d01
1945-04-09
1995-01-01
1995-05-31
70000
Engineer
d01
1945-04-09
1995-06-01
1995-09-30
10003
70000
Sr Engineer
d02
1945-04-09
1995-10-01
1996-01-31
10003
70000
Tech Leader
d02
1945-04-09
1996-02-01
1996-12-31
name
empno
salary
title
Bob
10003
60000
Bob
10003
Bob
Bob
Temporally grouped history of employees

name
empno
salary
60000
1995-01-01:1996-05-31
Bob
Engineer
1995-01-01:1995-09-30
deptno
70000
1995-10-01:1996-01-31
Tech Leader
1995-06-01:1996-12-31 1996-02-01:1996-12-31
DOB
d01
1995-01-01:1995-09-30
Sr Engineer d02
10003
1995-01-01:1996-12-31 1995-01-01:1996-12-31
title
1945-04-09
1995-10-01:1996-12-31 1995-01-01:1996-12-31
XML Representation of DB History
<employees tstart="1995-01-01" tend="1996-12-31">
<employee tstart="1995-01-01" tend="1996-12-31">
<empno tstart="1995-01-01" tend="1996-12-31">10003</empno>
<name tstart="1995-01-01" tend="1996-12-31">Bob</name>
<salary tstart="1995-01-01" tend="1995-05-31">60000</salary>
<salary tstart="1995-06-01" tend="1996-12-31">70000</salary>
<title tstart="1995-01-01" tend="1995-09-30">Engineer</title>
<title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title>
<title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title>
<deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno>
<deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno>
<DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB>
</employee>
<!-- … -->
</employees>
Advantages of XML Representations

The attribute value history is grouped, and can
be queried directly

The H-document has a well-defined schema
generated from the current table

The interval constraints are maintained in the
updates
Outline

Motivation

Viewing Relation History in XML

Temporal Queries with XQuery

The ArchIS System

Performance Study

Database Compression

Conclusion
Temporal Queries with XQuery

XQuery: the coming standard query language
for XML

With XQuery, we can specify temporal queries
without any extension:
 Temporal
projection, snapshot queries, temporal joins,
interval queries
queries: A SINCE B, continuous periods,
period containment
 Complex
Temporal Queries with XQuery

Temporal projection: retrieve the salary history of “Bob”:
element salary_history {
for $s in doc("employees.xml")/
employees/employee/[name=“Bob”]/salary
return $s }

Snapshot queries: retrieve the departments on 1996-01-31:
for $d in doc("depts.xml")/depts/dept
[tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"]
let $n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"]
let $m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>=
"1996-01-31"]
return( element dept{$n,$m } )
Temporal Functions
Shield the user from the low-level details used in
representing time, e.g., “now”
 Eliminate the need for the user to write complex
functions, e.g., coalescing
 Predefined functions:

 Restructuring: coalese($l)
 Period
tmeets
comparison : toverlaps, tprecedes, tcontains, tequals,
 Duration
and date/time: tstart($e), tend($e), timespan($e)
 telement(Ts,
Te): constructs an empty element
element timestamped as tstart=Ts, tend=Te
Support for ‘now’
‘now’: no change until now
 Internally, “end of time” values are used to denote
‘now’, e.g., 9999-12-31
 Intervals are only accessed through built-in
functions: tstart() returns the start of an interval, tend() returns

the end or CURRENT_DATE if it’s different from 9999-12-31

In the output, tend value can be:
 “9999-12-31”
 CURRENT_DATE
by using rtend($e) that recursively
replaces all the occurrence of 9999-12-31 with the
current date,
 “now”,
using externalnow($e) that recursively replaces
all the occurrence of \9999-12-31" with the string \now".
Outline

Motivation

Viewing Relation History in XML

Temporal Queries with XQuery

The ArchIS System

Performance Study

Database Compression

Conclusion
The ArchIS System

Two approaches are possible for storing and querying Hdocuments (H-views)

Native XML database approach: store H-documents directly
into XML DB

XML-enabled RDBMS. Design issues include:
 mapping
(shredding) the XML views representing the Hdocuments into tables (H-tables)
 translation
 indexing,

of queries from the XML views to the H-tables
clustering and query mapping techniques
ArchIS: Archival Information System
The ArchIS System: Architecture
Current Database
Relational Data
SQL Queries
Active Rules/
update logs
Temporal XML Data
H-views
(H-documents)
H-tables
Temporal XML Queries
A
R
C
H
I
S
H-tables

Assumptions
 Each
entity or relation has a unique key ( or
composite keys) to identify it which will not change
along the history. e.g., employee: empno

H-tables:
 attribute
history table: store history of each attribute
 key table: built for the key
 global relation table: record the history of relations

e.g.: current database:
 employee(empno,
title)
name, sex, DOB, deptno, salary,
H-tables
current
table
employee
(cont’d)
H-tables
relations(relationname, tstart, tend)
global relation
table
empno
key table
employee_id(id, tstart, tend)
name
attribute history
employee_name(id, name, tstart, tend)
…
table
…
salary
employee_salary(id, salary, tstart, tend)
title
employee_title(id, title, tstart, tend)
H-tables (cont’d)

Sample contents of employee_salary:
ID
=======
100022
100022
100022
100022
100022
100023
...
SALARY
=======
58805
61118
65103
64712
65245
43162
TSTART
==========
02/04/1985
02/05/1986
02/05/1987
02/05/1988
02/04/1989
07/13/1988
TEND
==========
02/04/1986
02/04/1987
02/04/1988
02/03/1989
02/03/1990
07/13/1989
Updating Table Histories

Changes in the current database can be
tracked with either update logs or triggers
 DB2:
triggers
 ArchIS:
update logs
Query Mapping

General purpose query mapping: XPeranto

In ArchIS, we have well-defined mapping
between H-documents (or H-views) and Htables

We map temporal XQuery queries into SQL,
utilizing SQL/XML
 SQL/XML is
a new standard to map between RDBMS
and XML
 Both
tag-binding and structure construction is pushed
inside the relational engine, thus be very efficient
SQL/XML Publishing Functions

XMLElement and XMLAttribute
select XMLElement (Name "dept",
XMLAttributes (tstart as "tstart", tend as "tend"),
deptname) from dept where deptname = ‘Sales’
<dept tstart = "02/04/1985" tend = "12/31/9999"> Sales </dept>

XMLAgg
select XMLElement (Name as "new_employees",
XMLAttributes ("02/04/2003" as "Since")
XMLAgg
(XMLElement (Name as "employee", e.name))
from employee_name as e
where e.tstart >= ‘02/04/2003’ <new_employees Since = "02/04/2003">
<employee>Bob</employee>
<employee>Jack</employee>
</new_employees>
XQuery Mapping to SQL with SQL/XML

Temporal projection: retrieve the salary history of
“Bob”:
element salary_history {
for $s in doc("employees.xml")/
employees/employee/[name=“Bob”]/salary
return $s }
select
XMLElement (Name "salaryhistory",
XMLAgg (XMLElement (Name as "salary",
XMLAttributes (S.tstart as tstart,
S.tend as "tend"), S.salary)))
from employee_salary as S, employee_name as N
where N.id = S.id and N.name = 'Bob'
group by N.id
XQuery Mapping to SQL with SQL/XML: Steps

Identification of variable range
 Map
variables in FOR/LET clause into underlying Htables

Generation of join conditions
 There
is a join condition any pair of distinct tuple
variables: join them by ids

Translation of built in functions
 Map
built-in temporal functions in XQuery into
functions in ArchIS

Output generation
 use
XMLElement and XMLAgg constructs
Temporal Clustering and Indexing

Tuples in H-tables are stored in the order of
updates, thus neither temporally clustered nor
clustered by objects

Traditional indexes such as B+ Tree will not help
on snapshot queries, and better temporal
clustering is needed

For every segment, usefulness: U = Nlive/Nall
 At
the beginning, U =100%, and it decreases with
updates
 The
minimum tolerable usefulness: Umin
Segment-based Clustering Scheme
Live
Live
All
All
All
Segment 1
Segment 2
Segment 3
segstart1
segend1
segstart2
segend2
tstarttuple <= segendSEG
tendtuple >= segstartSEG
segstart3
segend3
Segment-based Clustering Scheme

Initially all tuples for an attribute history table are
archived in a live segment SEGlive with usefulness
U =100%. With updates, when U drops below Umin:
1. A new segment is allocated;
2. The interval of this segment is recorded in the table
segment(segno, segstart, segend);
3. All tuples in SEGlive are copied into a new segment
Si sorted by id;
4. All live tuples in SEGlive are copied into a new live
segment SEGlive', and the old live segment is dropped;
After that, the new segment SEGlive’ becomes the new
starting segment for updates
Segment-based Clustering Scheme (cont’d)

Sample segments:
Segment1 (01/01/1985 - 10/17/1991):
ID
SALARY TSTART
TEND
100002 40000 02/20/1988
02/19/1989
100002 42010 02/20/1989
02/19/1990
100002 42525 02/20/1990
02/19/1991
100002 42727 02/20/1991
12/31/9999
...
Segment2 (10/18/1991 - 07/08/1995):
ID
SALARY TSTART
TEND
100002 42727 02/20/1991
02/19/1992
100002 45237 02/20/1992
02/18/1993
100002 46465 02/19/1993
02/18/1994
100002 47418 02/19/1994
02/18/1995
100002 47273 02/19/1995
12/31/9999
...
Advantages of
Segment-based Clustering Scheme

The current live segment always has a high
usefulness, assuring efficient updates;

Records are globally temporally clustered on
segments;

For snapshot queries, only one segment is used;
for interval queries, only segments involved are
used;

Flexibility to control the number of redundant
tuples in segments with Umin
Storage Usage of Segment-based Clustering
Relative storage size with different Umin
1.6
Storage Size
Nseg <= N0/(1-Umin)
1/(1-Umin)
Testing Data(Umin=0)
1.7
1.5
1.4
1.3
1.2
1.1
1.0
0.0
0.1
0.2
Umin
NS
0.3
0.4
Query Performance on Temporal Data with
Segment-based Clustering
ArchIS without segment-based clustering
ArchIS with segment-based clustering
Queries:
Point: Q1
Snapshot: Q2
Interval: Q5
History: Q3, Q4, Q6
5
Seconds
4
3
2
1
Q1
Q2
Q3
Q4
Q5
Q6
Outline

Motivation

Viewing Relation History in XML

Temporal Queries with XQuery

The ArchIS System

Performance Study

Database Compression

Conclusion
Performance Study: Experimental Setup

Systems: Tamino, DB2, and ArchIS
 ArchIS
uses BerkeleyDB as its storage manager, and it
builds on top of it a SQL query engine

Temporal data set: the history of 300,024
employees over 17 years
 The
simulation models real world salary increases,
changes of titles, and changes of departments
 The size of the XML data is 334MB
 The single large XML document is cut into a collection
of 15,000 small XML documents with around 25KB
each

Machine: Pentium IV 2.4GHz PC with RedHat 8.0
Performance Study: Query Performance
100
DB2
ArchIS
Tamino
DB2 and ArchIS:
with clustering
Tamino: without clustering
Seconds
10
1
0.1
Q1
Q2
Q3
Q4
Q5
Q6
snapshot query Q2 on ArchIS is 137 times faster than that on Tamino;
interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster,
and Q3 near 3 times faster.
Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41
times slower than archIS); interval query Q5 is 2.9 times faster than without clustering
( still 31 times slower than on ArchIS); history queries are much slower
Storage Utilization
Compression Ratio
1.5
1.0
0.5
0.0
DB2
ArchIS Tamino (with compression)
Outline

Motivation

Viewing Relation History in XML

Temporal Queries with XQuery

The ArchIS System

Performance Study

Database Compression

Conclusion
Database Compression

The disparity between CPU/memory and disk
speeds is becoming larger and larger
 Cost
to read one IDE disk page: 14ms
 Cost
to uncompress one page: 1.1ms(500MHz CPU)
0.26ms(2.4GHz CPU)
 Cost
to retrieve one compressed page: 14ms +
0.26ms = 14.3ms
 Cost
to retrieve uncompressed pages (3.6 pages):
 14ms x 3.6 = 50.4ms
Page-based Compression: PageZIP

Traditional data compression tools: compress a
file as a whole

PageZIP: page-based compression and
uncompression at the granularity of a page

Based on gzip library: zlib

Benefit: save space; point, snapshot or interval
queries only retrieve a small fraction of the
history, and can be efficient
PageZIP
page 1
page 2
page 3
…
Segment 1
…
Segment n
ID: 1001 - 1100
ID: 1100 - 1203
ID: 1203 - 1331
Storage Utilization with Compression
For each attribute history table, we compress it as
a sequence of pages and store each page as a
BLOB in a RDBMS
employee_salary (sid, salary, tstart, tend) =>
employee_salary_blob(pageno, startsid, endsid, pageblob)
without compression
1.5
Compression ratio

with compression
1.0
0.5
0.0
Tamino
DB2
ArchIS
Query Performance with Compression
DB2 without compression
DB2 with compression
Tamino
Seconds
100
ArchIS without compression
ArchIS with compression
10
1
Q1
Q2
Q3
Q4
Q5
Q6
Update Performance

For RDBMS, only the current segment is used for
updates. For Tamino, current data and historical data are
clustered together

Update an employee’s salary:


DB2: 0.29 seconds; Tamino: 1.2 seconds
Assume that every employee gets updated once a year:
about 1/260 of the total employee get updated every day
on average

DB2: 1.52 seconds; Tamino: 15 seconds

In the worse case for segment-based archiving: 39 seconds for
copying segments and 36 segments for compression: but only
once
Summary

We built a transaction time temporal database
on RDBMS and XML, with:
 XML
to support temporally grouped (virtual)
representations of the database history
 XQuery to express powerful temporal queries on such
views
 temporal clustering for managing the actual historical
data in a RDBMS
 SQL/XML for executing the queries on the XML views
as equivalent queries on the relational DB
 compression as option for efficient storage

ArchIS provides a unified solution for a wide
spectrum of temporal application problems
Future Work

Friendly temporal query interfaces based on
temporally grouped models

Other clustering and indexing techniques to be
investigated

Other efficient data compression techniques
proposed for XML data to be investigated

Apply the approach to valid-time DB and bitemporal DB

Apply the approach to OODBMS and semistructured data model