Download Comparison of Access Methods for Time-Evolving Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Commitment ordering wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Access wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Serializability wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
Comparison of Access Methods for
Time-Evolving Data
Betty Salzberg and Vassilis Tsotras
CS599, Temporal and Spatial Databases Course.
Presented by:
Atousa Golpayegani
11/16/2000
1
Temporal database design problem
Temporal
Queries
Access Methods
R-tree
snapshot-index
DBMS Software
B+-tree
time-index
Time Dimension
Access Method?
Database Designer
2
Temporal database design problem
Temporal
Queries
Access Methods
B+-tree
R-tree
snapshot-index
Access Method
Selection Criteria
DBMS Software
Selected Time Dimension
time-index
Access Method
Database Designer
3
Outline
 Brief introduction to temporal databases
 Criteria for comparison of access methods
 Efficient method design for transaction/Bitemporal data
 Method classification and comparison
4
Conventional Vs. Temporal
Databases
 Conventional database captures only a single snapshot of the
modeled reality, usually the most current.
Can not support past and future data.
Ex: “ the current salary of an employee”.
 Temporal database supports some time domain and is thus able to
manage time varying data.
User-defined time is excluded.
Ex: “ the current and the previous salaries of an employee since
hiring”.
5
What is Stored in a Temporal
Database?
 Tuple-versioning Temporal model is used. In this model Database is
a set of records that store the versions of the real life objects. Each
record has:
A time invariant key
A number of time variant attributes (for simplicity just one)
One or two intervals
Start time
End time
 Ex:
Employee (ss#, name, salary, [t1, now) )
6
Taxonomy of Time in a Temporal
Database
 Transaction Time
 Is defined as the time when a fact is stored in the database.
 Valid Time
 Is defined as the time when a fact becomes effective (valid)
in reality.
 Bitemporal Time
Is the combination of the above two types.
 Time is assumed to be discrete, and consecutive nonnegative
integers
 Any change is assumed to occur only at an indicated time
Addition of an object
Deletion of an object
Value change of an object’s attribute
7
Transaction Time Database
 Time Interval shows when an object was added and deleted.
 An object is alive from the time it is added to the database until it
is deleted.
Ex. Object ‘a’ is the only one that is currently alive.
 No record is physically deleted (logical deletion).
Therefore database has logical state
Ex. Object ‘b’ is deleted.
 Logical state at time t consists of those records whose
transaction time interval contains t.
Logical state T4
 Ex. S(T4) = (‘a’ , ‘b’)
a
b
T1
b
T2
T3
T4
T5
Time
8
Assumptions in a Transaction Time
Database
 Past states can not be changed.
 Linear transaction time evolution
 A New database state is created by updating only the current
database state.
 Another option is called branching transaction time evolution,
where new states can be created from any past database state.
 Implicit updating assumption
 If an object is updated at time t, the database system will be
updated at the same time. There is no delay.
9
Transaction Time Database Example
(SS1,A,X1)
T1
key
Name
Salary
Transaction Time
SS1
A
X1
[ T1, now)
SS2
B
X2
[ T2, T3)
SS2
B
X2
[T4, T5)
(SS2,B,X2)
T2
(SS12B,X2)
T3
T4
T5
Time
10
Transaction Time Database Access
Method Characteristic
 Store the past logical states.
 Support addition/deletion/modification changes on the objects of the
current logical state.
 Efficiently access and query objects in any of logical states.
Logical state T4
a
b
T1
b
T2
T3
T4
T5
Time
11
Valid Time Database
 Time Interval shows the validity period of an object in reality.
Ex. The period that a contract is valid.
 An object that is not valid anymore will be physically deleted
from the database.
 Only the latest “snapshot” of the collection of interval-objects are
kept.
I2
I1
Valid Time
State i-1
Valid Time
State i
12
Valid Time Database Access Method
Characteristic
 Store the latest collection of interval-objects.
 Support addition/deletion/modification changes to this collection.
 Efficiently query the interval-objects contained in the collection.
I2
Valid Time
13
Bitemporal Database
 Supports both transaction time and valid time.
 Instead of a single collection of interval-objects, there is a
sequence of collections, C(ti), indexed by transaction time.
 Transaction time and valid time are two orthogonal time
dimensions.
 Ex. (Contract#, amount, duration, [t1, now))
t1
C(t1)
t2
C(t2)
t3
C(t3)
t4
C(t4)
14
Bitemporal Database Access Method
Characteristic
 Store its past logical states.
 Support addition/deletion/modification changes on the intervalobjects of its current logical state.
 Efficiently access and query the interval-objects on any of its states.
15
Criteria for Comparison of
Access Methods
 Query
 Access Method Costs
 Index Pagination and Data Clustering
 Migration of Past Data to Another Location
 Lower Bounds on I/O Complexity
16
Query
 One of the criteria for the comparison is how efficient an access
method answers a query.
 Three general classes of queries are chosen.
 They can be used for both valid time and transaction time since
from a query perspective, valid time and transaction time are
simply collections of intervals.
17
Query Type I
Given a contiguous interval T, find all objects alive during this
interval.
 Representative query for this type:
Pure-timeslice query : A special case of type I when interval
T is reduced to a single time instant t
Ex.” Find all employees working at the company at time T’ “
T’
T
a
b
T1
b
T2
T3
T8
T9
Time
18
Query Type II
Given a key range and contiguous time interval T, find the objects
with keys in the given range that are alive during interval T.
 Representative query for this type:
Range-timeslice query : A special case of type II when interval
T is reduced to a single time instant t
Ex. “ Find the employees working at the company at time T’
and whose ssn belongs in range K “
Range = {a,b}
a
T
c
b
T1
T’
b
T2
T3
T8
T9
Time
19
Query Type III
Given a key range, find a history of objects in this range.
 Representative query for this type:
Pure-Key query : A special case of type III where the key
range is reduced to a single key
Ex. “Find the salary history of employee with ssn b“
Range = {b}
a
c
b
T1
b
T2
T3
T8
T9
Time
20
Bitemporal Queries
 Bitemporal queries will be a combination of the previous types.
 Ex.“ Find all the company contracts that were valid on v
= Jan,1,94 as recorded in the database during
transaction time interval
T = May,1 - May,20, 1993.
21
Three-Entry Notation Query-Type
Representation
Key / Valid / Transaction
 Values for Each entry:
Point
Range
‘ * ’
‘ - ‘
 Example
Transaction time Pure-Timeslice: “ * / - / point “
Valid Pure-Key: “ point / * / - “
22
Access Method Costs
 The performance of an access method is characterized by three
costs:
1) Storage Space, to physically store the data records and
the structures of the access method
2) Update Processing Time, is the time to update the
method’s data structures as a result of a change.
3) Query Time, for each of the basic queries.
23
Storage Space
 For Transaction time and Bitemporal methods, storage space is
a function of ‘n’, O(n). ‘n’ is the number of changes in the
database.
Ex. For 1000 insertions and 1000 deletions, n is 2000.
 For Valid time methods, storage space is a function of ‘l ’, O(l).
‘l ’ is the number of interval-objects currently stored in the
method..
Ex. If seven Objects are stored in the database, l = 7.
24
Update Processing Time
 Update processing time is dependent on the access method
index structure.
For tree structured access method, it is a logarithmic time.
Ex. The best in-core algorithm for valid pure-timeslice query,
requires O (log l ) update processing time per change.
25
Query Time
 A method’s query time is a function of the query answer size,
denoted by ‘ a ‘.
 It is usually consists of two parts:
 Logarithmic time for searching the access method structure to
find the answer records.
Time that is required to retrieve the result.
Ex. The best in-core algorithm for valid pure-timeslice query,
requires O (log l + a) query time.
26
Index Pagination and Data clustering
 In a database environment the major cost of computation is based on
how many pages are transferred between main memory and
secondary memory.
 Index pagination. How well index nodes of a method are paginated
on the secondary memory.
 Since index is used to search for and update data, its pagination
greatly improves the performance of the access method.
 Ex. B+-tree is a well paginated index, since it needs O(logB r)
page accesses for searching or updating r objects, using pages
of size B.
27
Index Pagination and Data clustering
 Data Clustering. Is to store ‘logically’ related data, physically close.
 It can substantially improve the performance of an access method
since in order to answer query, fewer pages are accessed.
Ex. A transaction pure-timeslice query takes O(logB n + a/B) page
accesses if the result are clustered in pages of size B. ‘n’ is the
number of changes in the database and ‘a’ is the answer size. If
result is not clustered O(logB n + a) page accesses are required.
28
Migration of Past Data to Another
Location
 Methods that support transaction time maintain all their past states.
This property can easily result in excessive amounts of data.
 Methods that support the following two criteria will have a better
performance:
 Whether or not past data can be separated from the current data, so
that the smaller collection of the current data can be accessed more
efficiently.
 Whether data is appended sequentially to the method and never
changed, so that write-once read-many (WORM) devices can be
used.
29
Migration of Past Data to Another
Location
 Since current data is more frequently accessed than past data, If a
method supports these criteria, past data can be moved to a higher
capacity, but slower medium such as optical disk.
 There are two ways for this separation:
 With Manual approach, a process vacuums all records that are
‘dead’ when the process is called and moves them to the optical
disk.
 With Automated approach, any objects that become ‘dead’
(logically deleted) will be moved to the optical disk
30
Lower Bound on I/O Complexity
 The minimum number of I/Os to solve a transaction time query is at
least (logB n + a/B)
 Since in a paginated environment O(logB n) is the minimum number
of I/Os to search for the answer in the index structure and O(a/B) is
the minimum number of pages to store the answer.
 This lower bound is under the assumption that all time instants have
the same probability of being in the query predicate.
 If queries have special probabilities, the logarithmic search part will
be improved. Ex. If most queries ask for the recent times.
31
Lower Bound on I/O Complexity
 The minimum number of I/Os to solve a valid time query is also at
least  (logB l + a/B).
 A method is called “ I/O Optimal “ if it achieves:
 O(logB n + a/B) query time and O(n/B) space for transaction time
 O(logB l + a/B) query time and O(l/B) space for valid time.
 O(logB m) update processing per change. ‘m’ is the number of alive
objects when the update takes place.
32
Efficient Method Design for
Transaction/Bitemporal Data
 The Question is how to efficiently store large amount of data
produced by a transaction time database.
 There are two approaches for storing the data to answer puretimeslice queries efficiently (copy, log):
 Copy approach stores a copy of the transaction database state
(timeslice) for each transaction time that at least one change
occurred.
Copy
Copy
Copy
Copy
Copy
a
b
T1
b
T2
T3
T4
T5
Time
33
Efficient Method Design for
Transaction/Bitemporal Data
 These copies are indexed by time t. Since changes arrive in order,
therefore indexes are paginated (which provides a minimal query
time).
 It means O(logB n) page accesses are needed to find the timeslice
and O(a/B) pages to retrieve the result. Therefore the page
accesses and also query time is O(logB n + a/B)
Copy 1 Rec.
Copy 2 Rec.
Copy 2 Rec.
Copy 2 Rec.
Copy 2 Rec.
a
b
T1
b
T2
T3
T4
T5
Time
34
Efficient Method Design for
Transaction/Bitemporal Data
 Disadvantages:
 The space used can in the worst case be O(n2/ B) pages.
1 + 2 + 3 + 4 + 5 + . . . + n = n *(n+1)/2 = (n2 + n)/2
O(n2/ B ).
 Update processing per change in the worst: O(n/ B ).
Copy 1 Rec.
a
Copy 2 Rec.
b
T1
Copy 3 Rec.
c
T2
T3
Copy 4 Rec.
d
T4
Copy 5 Rec. . . .
e
T5
Time
35
Efficient Method Design for
Transaction/Bitemporal Data
 Log approach stores only the changes that occur in the database
time-stamped by the time instant on which they occurred.
The space used is reduced to O(n/ B) pages.
The Update processing per change is also reduces to O(1).
The Query time increased to O(n/ B) Since to find a timeslice all
the past records may have to be searched
Copy 1 Rec.
Copy 1 Rec.
Copy 1 Rec.
Copy 1 Rec.
Copy 1 Rec.
a
b
T1
b
T2
T3
T4
T5
Time
36
Efficient Method Design for
Transaction/Bitemporal Data
 For the Pure-Key queries a better approach than copy and log is to
store history of each key separately, This creates a key-Only
method.
 For the Range-Timeslice queries it is best to cluster by transaction
time and key within pages.
 For the Bitemporal queries the copy approach can be used to store
the collections at given transaction time and a log approach to store
changes between copies.
Another approach is to index the bitemporal objects on a single
time axis. This approach is not efficient because very few of the
accessed objects may satisfy the valid-time predicate.
37
Method Classification
 Transaction access method is classified into three categories:
Time-only : clustering data by time only.
Key-only : clustering data by key only.
Time-key : clustering data by both time and key.
38
Examples of Method Classes
Transaction Time
time-only
key-only
time-key
append-only tree
time-index
differential file approach
achievable time index
snap shot index
window method
reverse chaining
accession lists
time sequence array
C-lists
composite indexes
segment-R tree
write-once B-tree
time-split B-tree
multiversion access
structure
overlapping B-tree
39
Time Index Access Method
(Time Only)
40
Time Index Access Method
(Time Only)
 Based B+-Tree access method on the time axis.
 New nodes are always added on the rightmost leaf of the index.
 It is equivalent to Copy approach
PureTimeSlice
Query Time
Update
processing
time
Space
O(logB n + a/B)
O(n/B)
O(n2/B)
O(n/B)
O(n2/B)
N/A
N/A
RangeTImeSclice
O(M logB n + a)
Pure-Key
N/A
41
Reverse chaining (Key-Only)
Key
Key
Current
Past
B+-tree
Index
B+-tree
Index
(K3,A3,I3)
(K2,A2,I2)
(K1,A1,I1)
P3
P2
ø
(K4,A4”,I4”)
(K3,A3’,I3’) ø
(K4,A4’,I4’)
(K4,A4,I4)
ø
(K2,A2”,I2”)
(K2,A2’,I2’) ø
42
Reverse Chaining
 The current and past data are stored separately.
 Previous version of a given key are linked together in reverse
chronological order.
 Current data is assume to be queried more often.(smaller data size
means faster query)
 Each tuple includes
Key
Attribute value
Life span interval
Pointer (which points to the previous version, if any, of this key).
 Update processing is O(logB n) , the query time is O(logB n + a), and
the used space is O(n / B)
43
Comparison of Access Methods for
time evolving data
44
The End
45