* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download 1995-01-01:1996-12-31
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational algebra wikipedia , lookup
Clusterpoint wikipedia , lookup
ArchIS: An Efficient Transaction-Time
Temporal Database System
Built on Relational Databases and XML
Fusheng Wang
University of California, Los Angeles
Motivation: Temporal Applications
Most database applications are temporal in nature:
Financial applications
Record-keeping applications
Scheduling applications
Scientific applications
Temporal Databases: the Reality
Over 40 temporal data models and query
languages have been proposed in the past
A long struggle to get around the limitations of
RDBMS
No DBMS vendors have moved aggressively to
extend SQL with temporal support
What’s Needed?
A temporal database system that provides:
Expressive temporal representations and data
models with minimal or no extension
Powerful languages for temporal queries with
minimal or no extension
Indexing, clustering and query optimization
techniques for efficient query support
Architectures that bring these together
Outline
Motivation
Viewing Relation History in XML
Temporal Queries with XQuery
The ArchIS System
Performance Study
Database Compression
Conclusion
Background:
Publishing Relational Database as XML
Publishing relational DBs as XML
as
actual XML documents: SQL/XML
as
XML views: SilkRoute, XPeranto
Viewing Relation History in XML
Our proposal: view the history of relational
DBs as XML documents:
Such
history can be naturally represented in
XML, without any extension to the data model
Temporal
queries can be expressed in
XQuery as is—without any extension to the
language
Amenable
for efficiently implementations
Temporal Grouping in XML
Temporal data models can be classified as:
Temporally ungrouped
Temporally grouped
Temporally grouped data models have more
expressive power and are more natural for users
It is difficult to fit temporally grouped models into
RDBMS
Temporally grouped data model can be
represented well in XML
Example: Transaction-Time History of Tables
Timestamped tuple snapshots (temporally ungrouped)
deptno
DOB
start
end
Engineer
d01
1945-04-09
1995-01-01
1995-05-31
70000
Engineer
d01
1945-04-09
1995-06-01
1995-09-30
10003
70000
Sr Engineer
d02
1945-04-09
1995-10-01
1996-01-31
10003
70000
Tech Leader
d02
1945-04-09
1996-02-01
1996-12-31
name
empno
salary
title
Bob
10003
60000
Bob
10003
Bob
Bob
Temporally grouped history of employees
name
empno
salary
60000
1995-01-01:1996-05-31
Bob
Engineer
1995-01-01:1995-09-30
deptno
70000
1995-10-01:1996-01-31
Tech Leader
1995-06-01:1996-12-31 1996-02-01:1996-12-31
DOB
d01
1995-01-01:1995-09-30
Sr Engineer d02
10003
1995-01-01:1996-12-31 1995-01-01:1996-12-31
title
1945-04-09
1995-10-01:1996-12-31 1995-01-01:1996-12-31
XML Representation of DB History
<employees tstart="1995-01-01" tend="1996-12-31">
<employee tstart="1995-01-01" tend="1996-12-31">
<empno tstart="1995-01-01" tend="1996-12-31">10003</empno>
<name tstart="1995-01-01" tend="1996-12-31">Bob</name>
<salary tstart="1995-01-01" tend="1995-05-31">60000</salary>
<salary tstart="1995-06-01" tend="1996-12-31">70000</salary>
<title tstart="1995-01-01" tend="1995-09-30">Engineer</title>
<title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title>
<title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title>
<deptno tstart="1995-01-01" tend="1995-09-30">d01</deptno>
<deptno tstart="1995-10-01" tend="1996-12-31">d02</deptno>
<DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB>
</employee>
<!-- … -->
</employees>
Advantages of XML Representations
The attribute value history is grouped, and can
be queried directly
The H-document has a well-defined schema
generated from the current table
The interval constraints are maintained in the
updates
Outline
Motivation
Viewing Relation History in XML
Temporal Queries with XQuery
The ArchIS System
Performance Study
Database Compression
Conclusion
Temporal Queries with XQuery
XQuery: the coming standard query language
for XML
With XQuery, we can specify temporal queries
without any extension:
Temporal
projection, snapshot queries, temporal joins,
interval queries
queries: A SINCE B, continuous periods,
period containment
Complex
Temporal Queries with XQuery
Temporal projection: retrieve the salary history of “Bob”:
element salary_history {
for $s in doc("employees.xml")/
employees/employee/[name=“Bob”]/salary
return $s }
Snapshot queries: retrieve the departments on 1996-01-31:
for $d in doc("depts.xml")/depts/dept
[tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"]
let $n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"]
let $m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>=
"1996-01-31"]
return( element dept{$n,$m } )
Temporal Functions
Shield the user from the low-level details used in
representing time, e.g., “now”
Eliminate the need for the user to write complex
functions, e.g., coalescing
Predefined functions:
Restructuring: coalese($l)
Period
tmeets
comparison : toverlaps, tprecedes, tcontains, tequals,
Duration
and date/time: tstart($e), tend($e), timespan($e)
telement(Ts,
Te): constructs an empty element
element timestamped as tstart=Ts, tend=Te
Support for ‘now’
‘now’: no change until now
Internally, “end of time” values are used to denote
‘now’, e.g., 9999-12-31
Intervals are only accessed through built-in
functions: tstart() returns the start of an interval, tend() returns
the end or CURRENT_DATE if it’s different from 9999-12-31
In the output, tend value can be:
“9999-12-31”
CURRENT_DATE
by using rtend($e) that recursively
replaces all the occurrence of 9999-12-31 with the
current date,
“now”,
using externalnow($e) that recursively replaces
all the occurrence of \9999-12-31" with the string \now".
Outline
Motivation
Viewing Relation History in XML
Temporal Queries with XQuery
The ArchIS System
Performance Study
Database Compression
Conclusion
The ArchIS System
Two approaches are possible for storing and querying Hdocuments (H-views)
Native XML database approach: store H-documents directly
into XML DB
XML-enabled RDBMS. Design issues include:
mapping
(shredding) the XML views representing the Hdocuments into tables (H-tables)
translation
indexing,
of queries from the XML views to the H-tables
clustering and query mapping techniques
ArchIS: Archival Information System
The ArchIS System: Architecture
Current Database
Relational Data
SQL Queries
Active Rules/
update logs
Temporal XML Data
H-views
(H-documents)
H-tables
Temporal XML Queries
A
R
C
H
I
S
H-tables
Assumptions
Each
entity or relation has a unique key ( or
composite keys) to identify it which will not change
along the history. e.g., employee: empno
H-tables:
attribute
history table: store history of each attribute
key table: built for the key
global relation table: record the history of relations
e.g.: current database:
employee(empno,
title)
name, sex, DOB, deptno, salary,
H-tables
current
table
employee
(cont’d)
H-tables
relations(relationname, tstart, tend)
global relation
table
empno
key table
employee_id(id, tstart, tend)
name
attribute history
employee_name(id, name, tstart, tend)
…
table
…
salary
employee_salary(id, salary, tstart, tend)
title
employee_title(id, title, tstart, tend)
H-tables (cont’d)
Sample contents of employee_salary:
ID
=======
100022
100022
100022
100022
100022
100023
...
SALARY
=======
58805
61118
65103
64712
65245
43162
TSTART
==========
02/04/1985
02/05/1986
02/05/1987
02/05/1988
02/04/1989
07/13/1988
TEND
==========
02/04/1986
02/04/1987
02/04/1988
02/03/1989
02/03/1990
07/13/1989
Updating Table Histories
Changes in the current database can be
tracked with either update logs or triggers
DB2:
triggers
ArchIS:
update logs
Query Mapping
General purpose query mapping: XPeranto
In ArchIS, we have well-defined mapping
between H-documents (or H-views) and Htables
We map temporal XQuery queries into SQL,
utilizing SQL/XML
SQL/XML is
a new standard to map between RDBMS
and XML
Both
tag-binding and structure construction is pushed
inside the relational engine, thus be very efficient
SQL/XML Publishing Functions
XMLElement and XMLAttribute
select XMLElement (Name "dept",
XMLAttributes (tstart as "tstart", tend as "tend"),
deptname) from dept where deptname = ‘Sales’
<dept tstart = "02/04/1985" tend = "12/31/9999"> Sales </dept>
XMLAgg
select XMLElement (Name as "new_employees",
XMLAttributes ("02/04/2003" as "Since")
XMLAgg
(XMLElement (Name as "employee", e.name))
from employee_name as e
where e.tstart >= ‘02/04/2003’ <new_employees Since = "02/04/2003">
<employee>Bob</employee>
<employee>Jack</employee>
</new_employees>
XQuery Mapping to SQL with SQL/XML
Temporal projection: retrieve the salary history of
“Bob”:
element salary_history {
for $s in doc("employees.xml")/
employees/employee/[name=“Bob”]/salary
return $s }
select
XMLElement (Name "salaryhistory",
XMLAgg (XMLElement (Name as "salary",
XMLAttributes (S.tstart as tstart,
S.tend as "tend"), S.salary)))
from employee_salary as S, employee_name as N
where N.id = S.id and N.name = 'Bob'
group by N.id
XQuery Mapping to SQL with SQL/XML: Steps
Identification of variable range
Map
variables in FOR/LET clause into underlying Htables
Generation of join conditions
There
is a join condition any pair of distinct tuple
variables: join them by ids
Translation of built in functions
Map
built-in temporal functions in XQuery into
functions in ArchIS
Output generation
use
XMLElement and XMLAgg constructs
Temporal Clustering and Indexing
Tuples in H-tables are stored in the order of
updates, thus neither temporally clustered nor
clustered by objects
Traditional indexes such as B+ Tree will not help
on snapshot queries, and better temporal
clustering is needed
For every segment, usefulness: U = Nlive/Nall
At
the beginning, U =100%, and it decreases with
updates
The
minimum tolerable usefulness: Umin
Segment-based Clustering Scheme
Live
Live
All
All
All
Segment 1
Segment 2
Segment 3
segstart1
segend1
segstart2
segend2
tstarttuple <= segendSEG
tendtuple >= segstartSEG
segstart3
segend3
Segment-based Clustering Scheme
Initially all tuples for an attribute history table are
archived in a live segment SEGlive with usefulness
U =100%. With updates, when U drops below Umin:
1. A new segment is allocated;
2. The interval of this segment is recorded in the table
segment(segno, segstart, segend);
3. All tuples in SEGlive are copied into a new segment
Si sorted by id;
4. All live tuples in SEGlive are copied into a new live
segment SEGlive', and the old live segment is dropped;
After that, the new segment SEGlive’ becomes the new
starting segment for updates
Segment-based Clustering Scheme (cont’d)
Sample segments:
Segment1 (01/01/1985 - 10/17/1991):
ID
SALARY TSTART
TEND
100002 40000 02/20/1988
02/19/1989
100002 42010 02/20/1989
02/19/1990
100002 42525 02/20/1990
02/19/1991
100002 42727 02/20/1991
12/31/9999
...
Segment2 (10/18/1991 - 07/08/1995):
ID
SALARY TSTART
TEND
100002 42727 02/20/1991
02/19/1992
100002 45237 02/20/1992
02/18/1993
100002 46465 02/19/1993
02/18/1994
100002 47418 02/19/1994
02/18/1995
100002 47273 02/19/1995
12/31/9999
...
Advantages of
Segment-based Clustering Scheme
The current live segment always has a high
usefulness, assuring efficient updates;
Records are globally temporally clustered on
segments;
For snapshot queries, only one segment is used;
for interval queries, only segments involved are
used;
Flexibility to control the number of redundant
tuples in segments with Umin
Storage Usage of Segment-based Clustering
Relative storage size with different Umin
1.6
Storage Size
Nseg <= N0/(1-Umin)
1/(1-Umin)
Testing Data(Umin=0)
1.7
1.5
1.4
1.3
1.2
1.1
1.0
0.0
0.1
0.2
Umin
NS
0.3
0.4
Query Performance on Temporal Data with
Segment-based Clustering
ArchIS without segment-based clustering
ArchIS with segment-based clustering
Queries:
Point: Q1
Snapshot: Q2
Interval: Q5
History: Q3, Q4, Q6
5
Seconds
4
3
2
1
Q1
Q2
Q3
Q4
Q5
Q6
Outline
Motivation
Viewing Relation History in XML
Temporal Queries with XQuery
The ArchIS System
Performance Study
Database Compression
Conclusion
Performance Study: Experimental Setup
Systems: Tamino, DB2, and ArchIS
ArchIS
uses BerkeleyDB as its storage manager, and it
builds on top of it a SQL query engine
Temporal data set: the history of 300,024
employees over 17 years
The
simulation models real world salary increases,
changes of titles, and changes of departments
The size of the XML data is 334MB
The single large XML document is cut into a collection
of 15,000 small XML documents with around 25KB
each
Machine: Pentium IV 2.4GHz PC with RedHat 8.0
Performance Study: Query Performance
100
DB2
ArchIS
Tamino
DB2 and ArchIS:
with clustering
Tamino: without clustering
Seconds
10
1
0.1
Q1
Q2
Q3
Q4
Q5
Q6
snapshot query Q2 on ArchIS is 137 times faster than that on Tamino;
interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster,
and Q3 near 3 times faster.
Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41
times slower than archIS); interval query Q5 is 2.9 times faster than without clustering
( still 31 times slower than on ArchIS); history queries are much slower
Storage Utilization
Compression Ratio
1.5
1.0
0.5
0.0
DB2
ArchIS Tamino (with compression)
Outline
Motivation
Viewing Relation History in XML
Temporal Queries with XQuery
The ArchIS System
Performance Study
Database Compression
Conclusion
Database Compression
The disparity between CPU/memory and disk
speeds is becoming larger and larger
Cost
to read one IDE disk page: 14ms
Cost
to uncompress one page: 1.1ms(500MHz CPU)
0.26ms(2.4GHz CPU)
Cost
to retrieve one compressed page: 14ms +
0.26ms = 14.3ms
Cost
to retrieve uncompressed pages (3.6 pages):
14ms x 3.6 = 50.4ms
Page-based Compression: PageZIP
Traditional data compression tools: compress a
file as a whole
PageZIP: page-based compression and
uncompression at the granularity of a page
Based on gzip library: zlib
Benefit: save space; point, snapshot or interval
queries only retrieve a small fraction of the
history, and can be efficient
PageZIP
page 1
page 2
page 3
…
Segment 1
…
Segment n
ID: 1001 - 1100
ID: 1100 - 1203
ID: 1203 - 1331
Storage Utilization with Compression
For each attribute history table, we compress it as
a sequence of pages and store each page as a
BLOB in a RDBMS
employee_salary (sid, salary, tstart, tend) =>
employee_salary_blob(pageno, startsid, endsid, pageblob)
without compression
1.5
Compression ratio
with compression
1.0
0.5
0.0
Tamino
DB2
ArchIS
Query Performance with Compression
DB2 without compression
DB2 with compression
Tamino
Seconds
100
ArchIS without compression
ArchIS with compression
10
1
Q1
Q2
Q3
Q4
Q5
Q6
Update Performance
For RDBMS, only the current segment is used for
updates. For Tamino, current data and historical data are
clustered together
Update an employee’s salary:
DB2: 0.29 seconds; Tamino: 1.2 seconds
Assume that every employee gets updated once a year:
about 1/260 of the total employee get updated every day
on average
DB2: 1.52 seconds; Tamino: 15 seconds
In the worse case for segment-based archiving: 39 seconds for
copying segments and 36 segments for compression: but only
once
Summary
We built a transaction time temporal database
on RDBMS and XML, with:
XML
to support temporally grouped (virtual)
representations of the database history
XQuery to express powerful temporal queries on such
views
temporal clustering for managing the actual historical
data in a RDBMS
SQL/XML for executing the queries on the XML views
as equivalent queries on the relational DB
compression as option for efficient storage
ArchIS provides a unified solution for a wide
spectrum of temporal application problems
Future Work
Friendly temporal query interfaces based on
temporally grouped models
Other clustering and indexing techniques to be
investigated
Other efficient data compression techniques
proposed for XML data to be investigated
Apply the approach to valid-time DB and bitemporal DB
Apply the approach to OODBMS and semistructured data model