Download 497-273

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
A Survey of Persistent Data Structures
GEORGE LAGOGIANNIS1, YANNIS PANAGIS1,2, SPYROS SIOUTAS1,2 AND
ATHANASIOS TSAKALIDIS1,2
1
2
Computer Engineering and Informatics Dept.
Internet Technologies and Multimedia Unit
University of Patras
RA Computer Technology Institute
Patras, 26504
61 Riga Feraiou str., 26110, Patra
GREECE
GREECE
Abstract: - Several years after the introduction of the notion of persistence by Driscoll et al. [15] we
review, in this paper, the most important data structuring paradigms and applications both in main and
secondary memory.
Key-Words: - Data structures, Persistence, Temporal Databases, Fat node, Node Copying, Node Splitting,
Applications.
1
Introduction
In commonly used data structures, update operations
such as insertions and deletions are catastrophic in
the sense that they discard the data structure they are
applied to; these data structures are called
ephemeral. On the other hand, several application
fields require keeping track of all versions a data
structure has undergone. Those areas include
databases (especially temporal ones), computational
geometry, text editing and functional programming
languages.
These needs were acknowledged in the seminal
paper of Driscoll et al. [15], in which the notion of
persistent data structures was coined. We concisely
present the notion of persistence here. Consider a
data structure D. As an update is applied to D, the
update leads D from a version vi to its next version
vi+1. If persistence is supported, as D undergoes a
number of m update operations, all versions v0,
v1,…, vm-1 are maintained. There we identify two
flavours of persistence, namely partial persistence
and full persistence.
Partially persistent data structures allow storing
and accessing all versions, however modifications
can only be applied to the most recent version. On
the contrary, full persistent data structures support
access and modifications to any version. Suppose a
sequence of m updates have already been applied to
D. A further update to version vi will generate
version vm. There is a third kind of persistence,
defined in the paper of Driscoll, Sleator and Tarjan
[16], namely confluent persistence. Confluently
persistent data structures support an operation which
combines two versions of the data structure to yield
a new version. Confluently persistent data structures
appear when converting meldable data structures,
e.g. priority queues, union-find and deques, to
persistent.
Notably, persistent data structures are also met
under the notation purely functional data structures.
The term “purely functional”, indicates data
structures built using only commands that
correspond to LISP commads car, cdr, cons.
Those commands create nodes which are immutable
and hence fully persistent. Note however that not
each fully persistent data structure is purely
functional.
In the rest of the paper we review in Section 2,
general techniques to convert data structures to
persistent. Section 3 presents more efficient methods
to support persistence for special cases of data
structures. Section 4 covers persistence in secondary
memory and Section 5 concludes with application
areas
2
General
Techniques
persistent data structures
for
This section presents general techniques to support
persistence in data structures. Generality, in this
context, implies that the presented techniques are
not instantiated to a specific data structure. General
techniques assume the model of a linked data
structure of bounded in-degree.
Definition 1. A data structure D is called linked
data structure of bounded in-degree, iff:
1. It comprises a collection of nodes
2. Each node contains a collection of keys
drawn from an ordered set and pointer
fields to other nodes.
3. Each node has bounded fan-in.
4. One or more entry points, points where
navigation can start, are defined for D.
2.1 Partial and full persistence
We first of all consider the pointer machine model
of computation.
In persistent data structures we further assume
that an node maintains with each key k, a list of
versions that correspond to the versions of the
update operations on k. Driscoll et al. [15] have
presented two methods two obtain partial
persistence: fat node and node copying.
The fat node method is perhaps the most
obvious approach. A node is allocated for each node
ever present in the data structure. This node is
assumed to arbitrarily large and capable to store its
entire modification history. Whenever an update to
the ephemeral structure is performed, a new version
is added to the corresponding fat node.
Besides the non-realistic space assumptions,
another disadvantage of this approach is the
navigation overhead it occurs; if we maintain a fatnode over m update operations, each update or
access step is charged an extra O(log m) time1,
whereas the extra space is O(1).
In the node copying method, fixed size nodes
are used. When an update occurs if a persistent node
is full then a new node is created, which hosts the
most recent data entries. The latter node is linked to
the former by means of a copy pointer. Node
copying also requires direct ancestors of the
updated node to have their pointer fields updated.
Direct ancestors are located by maintaining ancesotr
poitners. Updates at direct ancestors may potentially
lead to cascading of updates. Driscoll et al. [15]
were able to prove however that only a constant
number of nodes is updated in the amortized sense.
Thus, both time and space overheads are amortized
O(1).
The fully persistent case is considerably
involved. The first issue is navigation. At a
persistent node one searches for a data value and a
version. The rule is that given the data value, follow
the version immediately smaller than the sought one.
This scheme works fine in partial persistence,
nevertheless in full persistence, versions do not
evolve linearly but rather in a tree fashion and
therefore version ordering is not directly possible.
However, version tree can be linearized and
subsequently maintained by an order maintainance
data structure [14, 32], see [15] for more details.
The implementation schemes of [15] for full
persistence are two: fat node and node splitting. The
fat node method is essentially the same as in the
partially persistent case.
1
The time required to navigate over O(m) versions
Node splitting is as well a simulation of fat
nodes for full persistence, by nodes of fixed size.
The intuition is to maintain for each ephemeral node
that ever appeared a set of nodes, where each one
maintains several ancestor pointers. Recall that
updates can occur to any version of the node data
fields. In the general case, at each node v
participating in an update of field f to version i, two
extra nodes may be created. This will lead to
updates on ancestors of v. Furthermore, updates of
any version may lead v to overflow. Copying is not
applicable since any data field version may be valid.
Consequently, v must be split into two parts.
Apparently, node splittings may cascade throughout
the data structure. Nevertheless, it was shown in
[15] that only O(1) nodes are affetced in the
amortized sense. Therefore, the time and update
overhead per access and update step is only O(1)
amortized, as long as each ephemeral node is of
bounded in-degree.
An open issue for some years was how to
eliminate the amortization from the bounds of
Driscoll et al. An answer to this question for the
RAM model of computation was given by Dietz and
Raman [13]. The authors designed a pebble game in
graphs and thereby managed to obtain the bounds of
[15] in the worst case. Brodal [3] presented a
combinatorial game to eliminate amortization for
partially persistent data structures of bounded indegree in the pointer machine.
2.2 Confluent persistence for general data
structures
The methods of Driscoll et al. [15], unfortunately do
not carry over to the case where confluent
persistence is required. The reason is that the
navigation between versions becomes impossible,
since confluent persistence defines a graph of
versions. Other reasons also exist to justify why
fully persistent methods fail. One of them is that
even a deque with two elements is concatenated
with itself n times the resulting persistent structure
will occupy 2n nodes.
Fiat and Kaplan [18] present efficient methods
to transform general linked data structures to
confluently persistent. The core of these methods is
to succintly represent nodes who have a common
“version ancestor”. They showed that if the total
number of assignments is U then the update creating
version v will cost O(d(v)+log U), where d(v) is the
depth of v in the version graph and the space
requirements are O(d(v)log U) words. Fiat and
Kaplan have also presented more efficient methods.
For details see [18].
3
Persistence for special kinds of
data structures
The methods of the previous section apply to
general data structures. However, more efficient
persistent data structures are available for many data
structures instances.
3.1 Arrays
Dietz [12] studied ways to make arrays and
array-based data structures fully persistent. His
method was to represent each array as a huge fat
node, and handle updates with combinations of the
van-Emde Boas tree, perfect hashing and bucketing.
Thus, access costs O(log log m) and updates O(log
log m) expected amortized time, where m as before
is the total number of updates.
3.2 Search Trees
Driscoll et al. [15] apply their techniques for
full persistence in the case of red-black trees [19]
only this time, by modifying the delayed update
technique of Tsakalidis [33], bounds are turned to
worst case.
3.3 Stacks, queues and deques
Stacks, queues and deques are the simplest data
structures that can be converted into fully persistent.
Results for those data structures relate persistence to
functional programming.
The most general structure is the double-ended
queue or deque for short. It is a doubly linked list
where update operations are only allowed at both
ends. When only a subset of operations is allowed,
we obtain stacks and queues, respectively.
We do not refer separately to stacks and queues,
since their persistent implementation result from
persistent deques. For a complete account on the
history of purely functional lists, refer to [23] or to
the book of Okasaki [28].
The first fully persistent deque implementation
was presented by Chuang and Goldberg [9]. They
decompose the deque into two stacks, one for each
end of the deque. Stacks can be easily implemented
with standard functional commands to be purely
functional. Furthermore, they use incremental global
rebuilding to ensure that no stack maintains less than
a certain number of elements. Thus, they manage to
support every persistent operation in O(1) worst case
time and space overhead.
When the deque catenation repertoire is
extended to include catenation of deques the
problem boils down to support confluenly persistent
operations. The first work on the field was by
Driscoll, Sleator and Tarjan [16]. They showed how
to perform the k-th deque operation, in O(loglog k)
time. An improvement was given by Buchsbaum
and Tarjan [8]. They represented a deque as a an
unbalanced tree of logarithmic height and
consequently, decomposed this tree into paths. The
paths were also recursively represented in the same
way (structural decomposition). Thereby, they
showed that deque catenation costs O(1) and updates
at
the
ends
cost
O(log*n),
where
log* n  1  log* (log n) .
Finally, Kaplan and Tarjan [23] gave an optimal
solution to the deque problem. They represent each
deque as a 6-tuple [(p1,d1,s1),(p2,d2,s2)] where each pi
and si are non-catenable deques and di are catenable
deques recursively defined in the same way2. The
entire representation in [23] is considerably
complicated. However, the intuition is to associate
each recursive level with a digit of a redundant
counter [10] and thus to control movements of
elements in the same way a redundant counter
controls carry propagation. The deque of Kaplan and
Tarjan supports each deque operation in O(1) time
in the worst case.
If it is allowed to resort to amortization (and
memoization) Kaplan et al. [22] use similar ideas to
[23], in order to derive a simple deque where each
operation is performed in O(1) amortized time.
3.4 Priority Queues
Brodal and Okasaki [4] have presented a priority
queue data structure which supports each operation
with O(1) time overhead.
4
Persistence
Memory
in
Secondary
Application of persistence to secondary memory
data structures is of independent interest since
persistence finds a fertile ground in databases. A
simple example is that of transaction databases:
databases that store information with a certain
lifetime. Consider for example a DB for patients in a
hospital. Each record stores a unique patient_id and
the period for which a patient was hospitalized. In
order to retrieve a patient’s history, all time periods
for all patients must be stored. This is tantamount to
maintain a persistent dictionary, where the indexed
information is patient_id and nursing period is an
interval [vj, vk], and vj (vk) denotes versions where
the patient_id was insertedd (deleted).
An extensive treatment of temporal and bitemporal DBs and their relation to persistence can
be found in the survey of Salzberg and Tsotras [30].
4.1 Persistent B-trees
The work in this area was initiated by the paper
of Lanka and Mays [27]. They present an extension
2
Only that this time their elements are triples (p,d,s) of
the above level.
of the fat-node technique to devise fully persistent
B+-trees.
A fat node corresponds to a node of an
ephemeral B+-tree. A fat node is simulated by a
sequence of fixed size disk blocks. There are two
kinds of disk blocks: the data blocks and the version
blocks. Data blocks store values and version blocks
store relevant versions. An update to a value field
causes a new data block to be created copying all the
old values and the updated field’s value. Apparently,
a fat node may undergo several changes and
therefore, its associated version blocks can form a
version tree. The height of this tree can be O(logBm).
Recall that in the case of node-splitting of
Driscoll et al. we were having overflows to the
number of version fields. Overflows were handled
with node-splitting. This procedure is often termed
version splitting in contrast to the situation that
occurs in the trees of [27], where a fat node is split
when the number of keys after an update cause a
block overflow. This case is called key splitting and
it is handled with the creation of a new fat node and
a new version to the old fat node.
The space consumption of this method is huge
since it is O(m) blocks in the worst case. Access can
be performed in O(logBm (logBn+k/B)) time and
updates in O(logBm logBn) time, where n denotes the
number of distinct values at the time of the operation
and k the answer size.
In order to reduce the space requirements of
their data structure, Lanka and Mays, propose a
method called fat field. In this method the empty
fields of a block inside a fat node, store
modifications of data fields as long as they don’t
create any overflows. If an overflow appears, the
version causing overflow and the available data
fields at this time are copied to a new fat node. This
is time-split. If the number of keys causes key
overflow then a key split also occurs. With this
method the space occupancy is reduced to O(m/B).
Optimal partially persistent B+-trees have also
been developed: the Multi Version B-Tree (MVBT)
of Becker et al. [Becker et al. 97] and the Multi
Version Access Structure (MVAS) of Varman and
Verma [34]. The above methods share essentially
the same ideas however MVAS has slightly better
space consumption constant.
The key principle is not to simulate fat nodes
with blocks, but to embed time information in the Btree hierarchy. More specifically, the tree maintains
records with data fields and timestamps. Separate
thresholds are defined to control version overflows
and underflows in blocks. When an update is
performed to a block, the block might overflow.
Active nodes a then copied to a new block. Then, a
time split takes place, if necessary and updates are
propagated one level up the tree. Analogously, a
deletion may cause the number of active nodes to
underflow. In this case on the one hand the two
structures try to reduce space consumption. MVBT
tries to retrieve records from a sibling having
enough versions and copy all them to the
underflowed node. MVAS does not copy all of
them; just a fraction is adequate and therefore
achieves better space utilization.
Overall, space consumption is O(m/B) in both
cases and updates can be performed in O(logBn)
amortized time, where n is the number of active
values at a time. An important trait of both
structures is that they can answer stabbing queries
and range queries at O(logBn+a/B) time, i.e. their
performance depends entirely on the number of
active versions.
Arge et al. [2] have presented their practical
implementation of a partially persistent B-tree. The
ideas presented [2] do not significantly deviate from
MVBT or MVAS. The main difference is that they
do not use indexing nodes but their structure is
node-oriented. This choice was dictated by the
specific problem they had to solve, planar point
location, which only allows for partial ordering
between the elements in the structure. Time and
space bounds are the same as the previous
structures.
4.2 Bi-temporal (partially persistent) R-trees
Kumar et al. [25] have developed a partial
persistent version of the well-known R-tree data
structure [20], which is termed, Bi-temporal R-tree
(BTRT). The motivation behind BTRT was to
efficiently solve the indexing problem for “bitemporal” data, i.e. data that have both “valid time”
and “transaction time” attributes. For more details
see [30].
Since the ephemeral R-tree bears strong
conceptual similarities to B-trees, the BTRT bears
significant similarities with the partially persistent
data structures of [34] and [5]. More specifically, the
BTRT applies the same policies for updates with its
B-tree counterparts; if an insertion renders a node
full, time-split and perhaps key split is applied.
Deletions may also create version underflows. In
contrast to ephemeral R-trees, the BTRT applies
several heuristics to merge and underflowed page
with its siblings. Another notable characteristic is
that even though updates alter the spatial attributes
of index nodes, they do not create new versions to
internal nodes, which is not the case in MVBT and
MVAS. Furthermore, deletions do not cause an
enclosing rectangle at an index node to shrink, since
the leaves at each subtree will still be enclosed after
the logical deletion.
5
Applications
Persistent data structures literally have a myriad
of applications. We list the most important of them
without going into much detail and redirect the
interested reader to the relevant publications.
First of all persistent data structures can
facilitate list and set manipulation in functional
languages, where data is immutable. Confluently
persistent lists have applications to continuation
passing in functional programming languages [23].
Driscoll et al. [15] mention among the
applications of persistent data structures the
possibility of using them to incremental text editing
and program editing [29]. Liu [26] has also used
persistent data structures for debugging purposes on
a very high-level programming language. A further
application field for persistent data structures is
CAD [11]
Persistent data structures have met numerous
applications in Computational Geometry. The most
classical example of such application is the solution
to the planar point location problem, by Sarnak and
Tarjan [31]. The external memory version of this
problem was studied by Arge et al. [2]. Another bulk
of applications use persistence to reduce space
complexity in algorithms which incrementally
process a collection of geometric objects, in some
standard direction. Examples of those algorithms
include, hidden line elimination [24], dominance
pair reporting [7] and quadrant range searching on
a grid [JUCS]. Bouroujerdi and Moret [6]give some
further applications of persistence including
solutions to classical geometric problems. A few
more geometric applications can be found in the
excellent survey of Kaplan [21].
Finally, a large share of applications is met in
the area of temporal databases. Salzberg and Tsotras
[30] present persistent secondary memory index
methods, as the most provably efficient methods to
solve “transaction time” and “bi-temporal”
problems. Kumar et al. [KMT98] provide quite a
few applications for bi-temporal indexing methods
such as: accounting [35], medical [17], financial [1].
6
Conclusions
Persistent data structures have been proved
quite influential throughout the years, providing the
fundamental building block for applications in
computational geometry, functional programming,
temporal databases, etc. In this paper we have
presented an overview of persistent data structures,
covering both main memory and secondary memory
solutions. We have also presented some of their
applications in the hope to see more of them in the
future.
Acknowledgements
The authors would like to thank the European
Social Fund (ESF), Operational Program for
Educational and Vocational Training II (EPEAEK
II) and particularly the Program PYTHAGORAS,
for funding the above work.
References:
[1] Ariav, G. Information Systems for Managerial
Planning and Control: a Conceptual Examination
of their Temporal Structure. Journal of MIS,
Vol.9, No.2, 1992, pp.77-98.
[2] Arge, L., Danner, A. and Teh, S.-M. I/OEfficient Point Location Using Persistent BTrees, ACM J. on Experimental Algorithms, Vol.
8, 2003.
[3] Brodal, G.S., Partially Persistent Data Structures
of Bounded Degree with Constant Update Time,
Nordic Journal of Computing, Vol. 3, No. 3,
1996, pp. 238-255.
[4] Brodal, G.S., and Okasaki, C. Optimal Purely
Functional Priority Queues. J. Funct. Program.
Vol.6, No. 6, 1996, pp.839-857.
[5] Becker, B., Gschwind, S., Ohler, T., Seeger, B.,
and Widmayer, P., An Asymptotically Optimal
Multiversion B-tree, The VLDB Journal, Vol. 5,
No. 4, 1996, pp. 264-275.
[6] Bouroujerdi, A., and Moret, B., Persistence in
Computational Geometry, In Proc. 7th Canadian
Conference in Computational Geometry,
(CCCG95), 1995, pp. 241-246.
[7] Bozanis, P., Kitsios, N., Makris, C., and
Tsakalidis, A.K. The Space-Optimal Version of
a known Rectangle Reporting Algorithm.
Information Processing Letters, Vol. 61, 1997,
pp. 37-41.
[8] Buchsbaum and R. E. Tarjan. Confluently
persistent
deques
via
data
structural
bootstrapping. J. of Algorithms, 18:513-547,
1995.
[9] Chuang, T.R., Goldberg, B. Real-time deques,
multihead Turing machines, and purely
functional programming, in proc. of the
conference
on Functional
programming
languages and computer architecture, 1993,
pp.289-298.
[10] M. J. Clancy, D.E. Knuth, A programming and
problem-solving seminar, Technical Report
STAN-CS-77-606, Department of Computer
Science, Stanford University, Palo Alto, 1977
[11] Chou, H., and Kim, W. A Unifying Framework
for Version Control in a CAD environment. In
Proc. VLDB, 1986
[12] P.F. Dietz, Fully Persistent arrays, In Proc.
WADS 89, pp. 67-74.
[13] Dietz, P. And Raman, R. Persistence,
amortization and randomization, in proc. of the
2nd ACM-SIAM Symposium on Discrete
Algorithms, SODA91, 1991, pp. 78-88.
[14] Dietz, P. and Sleator, D.D. Two algorithms for
maintaining order in a list, in Proceedings 19th
Annual ACM Symp. on Theory of Computing,
1987, pp. 365-372.
[15] Driscoll, J.R., Sarnak, N., Sleator, D., and
Tarjan, R.E. Making Data Structures Persistent.
J. of Comp. and Syst. Sci. Vol 38, No. 1, 1989,
pp. 86-124.
[16] Driscoll, J.R., Sleator, D., and Tarjan, R.E.
Fully persistent lists with catenation. J. of the
ACM, Vol. 41, 1994, No. 5, pp. 943-959.
[17] Fries, J.F. Time-oriented Patient Records and a
Computer Databank. Journal of Am .Med.
Assoc., Vol. 222, No. 12, 1972.
[18] Fiat, A., and Kaplan, H. Making data structures
confluently persistent, Journal of Algorithms,
Vol. 48, No. 1, 2003, pp. 16-58.
[19] Guibas, L. J. and Sedgewick, R., A dichromatic
framework for balanced trees, in proc., 19th
Annual IEEE Symp. on Foundations of
Computer Science, 1978, pp. 8-21.
[20] Guttman, A. R-Trees: A Dynamic Index
Structure for Spatial Searching. in proc. ACM
SIGMOD Conf., 1984, pp. 47-57.
[21] Kaplan, H., Persistent Data Structures, in
Handbook of Data Structures, CRC Press,
Mehta, D., and Sahni, S., (eds.), 2004.
[22] Kaplan, H., Okasaki, C., and Tarjan, R. E.
Simple confluently persistent catenable lists.
SIAM J. Comput. Vol 30, No. 3, 2000, pp. 965977.
[23] Kaplan, H., and Tarjan, R.E. Purely functional,
real-time deques with catenation, J. of the ACM,
Vol. 46, No. 5, 1999, pp. 577-603.
[24] Kitsios, N., and Tsakalidis, A.K., Space
Optimal
Hidden-Line
Elimination
for
Rectangles, Information Processing Letters,
Vol. 60, 1996, pp. 195-200.
[25] Kumar, A., Tsotras, V.J., Faloutsos, C., 1998,
Designing access methods for bitemporal
databases, IEEE Trans. Knowl. Data Eng. 10, 1
(Jan./Feb.), 1998.
[26] Liu, Z., A persistent runtime system using
persistent data structures, in proc. of the ACM
Symposium on Applied Computing, (SAC), 1996,
pp. 429-436.
[27] Lanka, S. and Mays, E. (1991). Fully Persistent
B+-trees. In Proceedings of ACM SIGMOD
Conference on Management of Data, pages 426435, 1991.
[28] Okasaki, C. Purely Functional Data Structures,
Cambridge University Press, 1998.
[29] Reps, T., Teitelbaum, T., and Demers, A.,
Incremental context-dependent analysis for
language-based editors, ACM Trans. Program.
System. Lng., Vol. 5, 1983, pp. 449-477.
[30] Salzberg, B., and Tsotras, V. Comparison of
Access Methods for Time-Evolving Data. ACM
Computing Surveys, Vol. 31, No. 2, 1999, pp.
158-221.
[31] Sarnak, N., and Tarjan, R.E., Planar point
location using persistent search trees, Comm.
ACM, Vol. 29, 1986, pp. 669-679.
[32] Tsakalidis, A.K., Maintaining order in a
generalized linked list, Acta Inform., Vol. 21,
1984, pp. 101-112.
[33] Tsakalidis, A.K., AVL-trees for localized
search, Inform. and Control Vol. 67, 1985, pp.
173-194.
[34] Varman, P., and Verma, R. An efficient
multiversion access structure, Trans. On
Knowledge and Data Eng¸ Vol. 9, No. 3, 1997,
pp. 391-409.
[35] Weber, R. AuditTrail System Support in
Advance Computer-based Accounting Systems.
The Accounting Review, Vol. LVII, No. 2, 1982
pp. 311-325.