Download Persistence in OODBs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Versant Object Database wikipedia , lookup

Transcript
Object Oriented
Database
(Group Write-up)
Course:
Matric Number:
Term Time E-mail:
Date:
Course:
CO42009
96061707
96061707@ napier.ac.uk
28/04/02
Beng Hons Computing
Authors:
Barry Myles, Stephanie Dunsire, Gary Stewart
Presentation Title:
Persistence in OODBs
Napier University
Introduction to Persistence
“A persistent programming language (PPL) is a programming language that includes
a persistent memory area (e.g., a heap of objects) that outlives the execution of any
individual program.”
[Ref: Working with Persistent Objects – J. Eliot B. Moss]
To quantify what persistence means we must appreciate this subject in two ways.
Persistence when related to an information system (such as a database or application),
is the ability of the program to store its current state, on non volatile storage. This is
done as to allow the program to survive past the last point in which it was run. For
Instance any data held in RAM will not persist past the stage at which the power is cut
as this type of memory is volatile; we therefore describe data in this area as transient
data. However any piece of data held on secondary storage (for instance a hard drive)
could conceivably be expected to still be present and retrievable when the system is
turned off and back on.
The second other manner, in which the term persistence is used, is to describe the
length of time a single piece of data exists for.
Consider a typical application written in block structured programming language for
instance C++ or COBOL. We could confidently say that when processing this
information there will exist several types of looping structure. If we further consider
the contents of the “for” loop described in Fig 1 paying special attention to the
variable declared as “k.”
When the loop is constructed “k” is initialized. Every time the loop iterates “k” is
incremented by one. If the break condition is met and the loop terminates, should we
still retain the value of “k”? Sensibly the answer would be no! If we are to save the
system from performance degradation due to space being occupied by insignificant
data we should flag the memory reference or location for garbage collection
From creation to deletion this type processing can be all over in a matter of
nanoseconds. Meaning that the datum in question has persisted for a very short period
of time. Conversely if we consider the regulations imposed recently on ISP (Internet
Service Providers) whereby they are now duty-bound to log and prolong data
pertaining to the movements of their customers on the WWW (World Wide Web.)
This is an example of data persisting for a considerably longer period of time.
FIG 1
For (int k = 0; k < theActions.length; k++) {
if(theActions[k] == null)
{
panelPopup.addSeparator();
}
else
{
panelPopup.add(theActions[k]);
}
}
When coupling the programming power of an object-oriented language like Java or
C++ with the data storage and manipulation mechanisms of a relational database,
difficulties arise! These two components have there own very distinctive
characteristics when it comes to the storage and orchestration of data. Programming
languages in general are more akin with the manipulation of transient data. Relational
databases on the other hand tend toward the manipulation of persistent data.
“This requirement is evident from a database point of view, but a novelty from a
programming language point of view, ” [Ref: Atkinson 83 - An approach to
persistent programming].
It is clear then to successfully build a package with all the power from these two
fundamental elements we would be unwise to negate the concept of either transient or
persistent data; we must find a good mix of the two. Another issue, which arises in the
collaboration, is the differences in the modelling paradigms. How do we map objects
in an OO model to the tables in a relational database? Both of these problems lends
themselves to the term know as impedance mismatch. There are ways round this
although this is not the scope of this presentation.
So far we’ve talked about what persistence is and some problems with it, now let’s
move on to describe the mechanics of how the program writes of the persistent data to
secondary storage. There are three types of persistence each has it’s own methods for
dealing with the data presented.
Kinds of Persistence
The three types of persistence are
Session Persistence
This method saves the current state of execution by taking all the data being used and
dumping the lot into secondary storage.
Take for example the Windows 98 operating environment, a type of session
persistence is used in the facility know as “hibernation mode.” A brief overview; The
PC has the ability to temporarily shut itself down, when there has been no user input
via either the keyboard or mouse for a set period of time. When shutting down all data
which resides inside the current workspace (usually found in volatile stores such as
RAM and or Paged Memory), is written of to the hard disk. Data which may exists in
this workspace is usually operating system data, programs such as Internet Explorer,
Borland C++ and data that these program rely on.
When the user once again resumes his place behind the keyboard and reactivates the
PC by wiggling the mouse. They are infact unwittingly resurrecting the last current
execution point from secondary storage.
Benefits of this approach are:

Implementing this type of strategy is easy; everything in our memory area
becomes persistent, there is no choice. Thus there is no risk of data being
discarded which is needed later on.

The logical data structure is saved by not having to break up the program data into
chunks, also its relationship with the program at that time is also retained.
Drawbacks of this approach:

No way of deciding what is saved.

No way to share the data with more than one user. This is due to the program
code and program data being inseparable.
File Persistence
“Open” and “Save” functionality, is the most traditional way to allow the an
application the ability to read or write the current state of it’s program data to or from
secondary storage.
Using file persistence we are provided with the means to separate program data from
program code.
When this data is “saved” it exists on secondary storage as a file. The structure of the
file held on disk will differ significantly from the logical data structure held in main
memory.
To reverse this process and resurrect the state at which the machine was “saved”, we
need to explicitly read all the information into memory and combine it again with the
program code. We do this using the Open mechanism.
Consider Java’s method for implementing file persistence. Each individual object’s
current state can be changed to a binary stream, which is pumped out to secondary
storage in file format. This type of process is called Object Serialization and is the
easiest method of generating saved data, as all the programmer needs to do is specify
which objects are to made persistent by including the words “implements
java.io.Serializable” in the class declaration. Java then decides itself what information
is needed to allow us to resurrect the original state. To recover the saved state we
specify the file name and use object deserialization process to restore the object state
back to main memory.
Benefits of this approach are:

The designer has the ability to nominate only the information, which is
required to be stored. Unlike Session persistence, which stores all of the
information in the workspace, File persistence can cut down on the data stored
thus relieving the amount of storage space required.

Since the data is in a file format it can be easily transferred from User to User.
Drawbacks of this approach:

Every program must include Open and Save functionality. This makes the
system designers job harder as they have to implement and test these extra
methods.

We lose the simplicity of the single data model, as now we have three
concepts to deal with each with differing data structure. The internal program
model may contain lots of program specific data, which is, not capture in the
external file format view. Now if we also take into consideration the designers
abstract view of how the objects and data should be structured we have scope
for confusion as to what is represented and stored. To transform any one of
the 3 into another takes a great deal of processing and intellectual power.
Orthogonal Persistence
In orthogonal persistent systems the manner in which data is held on secondary
storage exactly matches the logical structure of the internal in-memory representation.
Any data, which is found in an orthogonal system, can be made persistent or transient.
More over any class, which is declared as persistent, will automatically store all it’s
lower classes as to maintain referential integrity.
When a table in a Relational database is accessed for instance, the in-memory
representation that is fetched and loaded into the paged memory will exactly resemble
the contents of that section on the hard disk.
Benefits of this approach are:

We have one single data model to follow.

To save loading every piece of data from secondary storage at once we only
need to fetch what is needed and give pointer references to the rest. This saves
the processor from carrying out a lot of needless swapping of data in and out
of the main memory area.

We can decide what data is to be saved.
Drawbacks of this approach are:

We cannot share this information easily with other systems.
Distinguishing Transient and Persistent Data
Introduction
The implementation of persistence varies between databases, this in not because each
development comes up with its own solution but because each different type of
implementation comes with its own merits and flaws.
The two main criteria are:
 How little effort it takes for the programmer to make objects persistent.
 Efficiency of the storage of data in the system.
A better solution in one of these criteria may result in a performance loss in the
second.
An object can be identified as persistent in one of two ways. It can be explicitly
marked as persistent by the programmer or it is referred to from a persistent object.
The latter is not implemented in all persistent stores but lack of this feature may lead
to dangling references. These will have to be dealt with by the database management
system and dangling references are generally considered a bad thing since the data is
no longer complete.
All objects that are not in this set are transient and will not exist past the execution of
the current code.
For example a persistent system may require the programmer to mark which objects
are to be persistent. This:
 Should minimise the storage requirements, by only storing objects that the
programmer has explicitly told to be stored.
 Is trickier for the programmer as they have to decide what to put in to the
database.
 May lead to items not being correctly made as persistent based on 'bad calls'
made by the programmer.
Another example is a system were the database management system deals with what
is made persistent and what is not. This leads to:
 The programmer being abstracted from the low-level persistence system.
Though there may be an ability for the programmer to indicate to the system
that a class or instance should be made persistent.
 The DBMS will automatically upgrade something to persist, note there is no
reason why something would be changed back to transient after being made
persistent.
Marking Persistence
A database management system can distinguish transient and persistent data in
different ways because each method has its own merits and flaws. Some database
management systems will implement more than one, or allow the behavior of one
system to be emulated by using another.
There are seven main methods of marking persistence. Each of these will be looked
into.
Persistent classes
If this system is used a class is declared persistent when it is created. These classes are
explicitly persistent and all objects using them will be persistent. Other classes, not in
this set are only made persistent by reachability, if this isn’t supported only objects
whose classes are declared as persistent can ever be made persistent. These are
obviously not orthogonal of type since persistence is only granted if they have been
assigned this persistent behavior. The programmer may have to later convert a class if
the model needs changing adding to the cost of development and reducing the
flexibility of the system. Since a class is known to be persistent then the memory for
an object of that class can automatically be put aside in the persistent store when that
object is created.
If a class has been created by third party then there may be no way to alter it and
hence change it to a persistent/transient type.
Persistent shadow classes
Persistent Shadow Classes automatically makes two classes for each class a
programmer generates. Then the programmer can either create a transient object or a
persistent object depending on which class they call. The transient version cannot be
upgraded to a persistent version however it should be possible to copy a transient
object to a persistent object of the same type. Because it is known whether an object is
persistent or not at creation memory in the persistent store can be set aside for it,
improving the efficiency.
This, unlike Persistent Classes insures that all classes have a persistent capable
version that removes the problem of a third party development not having persistent
capabilities.
Persistent Root Classes
This uses a root persistent class that must be inherited to make a class persistent.
Would restrict behavior of classes a great deal in hierarchical (single inheritance)
languages as the class can only inherit from the root class or one of its subclasses. It
may be possible to get around this in languages like Java but using interfaces such as
java.io.Seralizable. It works a lot better in multiple inheritance based
languages however it may still lead to complications based on the tricky way multiple
inheritance tends to work.
The object database management group's standard uses this for C++ bindings in their
system.
Persistence Declared at Object Creation
This gives a mechanism in which when an object is created it can be declared as
persistent. This will be noted by some key word when the object is dynamically
created. This involves more work for the implementation programmer, as they will
have to decide, during their coding what objects are to be elevated. It is a flexible
system since any object can be made persistent, and it is not based on class. Since the
object, when dynamically created, is known to be persistent the memory can be set
aside on the persistent store, so it still remains reasonable efficient. The program still
has to be careful not to waste space by retaining values that are designed only to be
used within the function. Especially if that function is called many times during
execution.
The object can still not be, under the default model of this, made persistent after it has
been declared. Though it might be possible to assign a non-persistent object to a
persistent one, hence making it persistent but this has a run-time cost.
Persistence by Explicit Storage
Persistence by Explicit Storage allows the programmer to promote an object to be
persistent at run time by using a keyword. The object is then copied to the persistent
store, which obviously has an overhead. This is one of the more flexible systems as
any particular instances of a class can be made persistent and this can happen at run
time. This therefore obviously supports upgrading objects to persistent at run time
and persistent versions of objects should be comparable with their transient
equivalent.
System-Provided Persistent Roots
System-Provided Persistent Roots provide a class that has a form of container in
which to place objects. Any object placed within this container is made persistent.
The ideal system should allow all objects and primitives to be added hence all
possibilities can be made persistent. It should also be globally available to insure that
the programmer can make an object persistent anywhere within their code.
If that object is large or has many links then the process of copying to persistence
store will be a large overhead. This behavior implies that persistence by reference
must exist.
The object database management group uses this for its Java and Smalltalk bindings.
This is a better choice than Persistent Root Classes for languages such as Java since
they do not support multiple inheritance and have an easily placeable hierarchy
structure for their classes. In Java all classes are children of java.lang.Object,
which can therefore be used as a generic class in which to contain.
Named Root Objects
By assigning something as a named root object then it becomes a root of persistence
much like the system provided persistent roots. Like system provided persistent roots
this implies that persistence by reference must exist.
The benefit of this over system-provided persistent roots is that the programmer
gets to dictate what begins this process. The programmer 'flags' an object that will
promote it to persistent then all other objects that are linked to this object can be
promoted through persistence by reachability.
Persistence by Reachability
Persistence by reachability means that if an object is persistent all objects that are
referred to by that object must also be made persistent or the references would
become invalid.
Explicit persistent objects therefore dictate what is made persistent however an object
that is being promoted to persistent in turn may have its own references that must go
through the same process, hence a cascading effect of promotion. Large objects with
lots of references will produce a significant overhead while being copied to the
persistent memory store. Programmers might be able to denote fields in objects that
are never to be made persistent (transient keyword in Java) hence saving space in
the persistent store. This has the potential problem of not storing things that will be
required when the code is reloaded.
Persistence by reachability maintains referential integrity by insuring that all objects
that are referred to are stored. It also may take away a significant portion of the
workload of the programmer as they do not need to provide explicit persistence too
much of the database.
It may however result in a loss of performance as more data than is required may be
stored and retrieved by the database. This performance loss is better than the
consequence of having the database refer to something that was transient and
therefore no longer exists in the database.
Storing Code and Data
The importance of Storing Code and Data is highlighted by the kinds of application
that require the use of databases as they hold very complex information.
The below diagram shows file systems, code is shown as ovals, while data is shown
as rectangles:
Figure 1
System (a) in figure 1 is an unprotected file system however it has a flexible
environment in which code and data can be freely mixed.
System (b) in figure 1 is a traditional DBMS that controls part of the file system,
which it uses to store and protect the data. Application code remains outside the
control of the DBMS.
System (c) in figure 1 is an orthogonally persistent language, which allows code to be
protected as well and for data and code to be freely mixed.
System (d) in figure 1 is an OODBMS file system that provides a class structure in
which the mixture of code and data is controlled.
Systems which exclude the program code from the structure of the database, can be
expected to increasingly fall into disuse because they make the creation and
maintenance of the applications exceptionally costly and error-prone
For instance programming languages with file persistence store programs in one set of
files and data in another set of files, the data and program files are controlled using
the directory structure of the file system. As this method provides no integrity services
accessing data files logically is the responsibility of the application program and the
software development environment tools for example the Unix tool SCCS.
Programming languages with orthogonal persistence deal with logic through
removing any distinction between code and data. As code is treated in the same
manner as normal data it can be stored and retrieved from the database.
As traditional database systems provide a severely imposed structure for the data and
deal with the organization of the data they are a great deal more organised with regard
to data. The problem with this approach is that although data is kept very consistent
there is no way of keeping application code consistence either themselves or with the
data as no structure is provided to keep the code and data together.
Removing Data
An implicit capability, which database systems must provide, is the facility to dispose
of data, which is no longer of use to the system. To enable unnecessary data to be
discarded there must a way in which the system can categorize data to state that it is
to be removed. The method used to identify unnecessary data is similar to the method
used to categorize persistent data for instance redundant data can either be explicitly
removed by the programmer or they can be deleted mechanically by the system after
the data becomes unusable i.e. the data can not be reached via the roots of persistence
thus the data management systems can provide clear delete operations (i.e. a C++
destructor) or data can be removed via the use of garbage collection. Some systems
can use a blend of these two approaches.
Explicit Deletion of Data – Systems can allow explicit deletion of objects and
classes. Some systems allow this, as it is believed that the programmer can achieve
the most efficient storage of data. This method can avoid integrity violations for
example by using C++ an operation called a destructor can be used, thus it can be
used to explicitly remove data without causing deletion violation.

Garbage Collection – Systems can also allow an automatic method of removing data
called garbage collection, a popular use of garbage collection is in an OODBMS,
which uses persistence through reachability. An example of this can be shown
through a course being cancelled at Napier due to this all students on this course can
be relocated on other courses or leave (in this scenario we will ignore the students that
leave), the course attribute of the student objects will be changed to their new courses.
Once all objects representing the members of the course are updated there will be no
references left from the student objects to the old course object thus it can be removed
from the object holding the set of courses and as there will be no existing references
to the object it will become unusable data the below diagram shows an object graph
with garbage and non-garbage aspects:
Figure 2
When garbage collection occurs there are two stages, which must be performed:
identification of the garbage data (mark): This starts at the persistent root objects it
marks everything referred to by the roots as required data, and then recursively
marking further objects referred to by already marked objects. Eventually no more
objects will be marked. Anything left unmarked is garbage at this point the second
stage is started the removal of the unwanted objects (sweep) so the space is freed
and returned to the free pool for future reallocation. In garbage collection these stages
are usually kept separate as it is simpler this method is usually called mark-andsweep. The problem with the mark-and-sweep method of garbage collection is that it
has the unfortunate tendency to fragment the memory (heap). This occurs when a
long-running program has undergone garbage collection several times. The dilemma
takes place when objects become spread out in the heap, hence objects are split by
small vacant memory regions, this gives rise to the problem that memory allocation
for an object may not be possible as although there is ample vacant memory it is not
in a continuous block.
Another method of garbage collection called stop-and-copy [3] (Microsoft use this
algorithm for Java garbage collection) this method collects garbage and de-fragments
the heap, this method involves the memory being divided into two regions an active
region and an inactive region (which switch with every cycle of the method). For any
moment in time all dynamically allocated live object instances will reside in the active
region, the inactive region is empty. The stop aspect of the method starts when the
memory in the active region is exhausted, the copy aspect of the method copies all of
the live objects from the active region to the inactive region, when each object is
copied it’s references are up dated to point towards there new locations. The final step
is to switch the active and inactive regions, as this method only copies the live objects
any objects left over are garbage and are deleted all at once when the active region
becomes inactive, this results in object being stored in continuous memory locations
thus the heap is automatically de-fragmented. The problem with this method is the
cost of doubling the size of the heap (i.e. active and inactive). An example of this
method is shown in figure 3.
Figure 3
Explanation of figure 3:
Stage 1: The lower half of memory is the inactive area and the upper active region is
partly used.
Stage 2: This shows the active region gradually being filled with objects.
Stage 3: This shows the active region is full at this point the collector stops.
Stage 4: The live objects in the active region (upper) are copied to the inactive region
(lower).
Stage 5: This is the state the memory is in when the process is complete
Stage 6: The process is now restarted with the lower region is the active area and the
upper region is the inactive non-used area and is gradually being filled with objects.
Stage 7: This again shows the active region is full at this point the collector stops.
Stage 8: The live objects in the active region (lower) are copied to the inactive region
(upper).
Stage 9: This is the state the memory is in when the process is completed, this process
can occur several times.
A further method of garbage collection is reference counting this method involves
every object keeping a count of all references made to it, thus every time a new
reference is made this total is incremented and every time a reference is removed the
total is decremented. If the total is zero the object is unreachable and so is garbage
and can be removed, a garbage collector can find objects with a reference count of
zero and reclaim the space they use. The two main problems of this method: the
counts and maintaining them can take up space and there is no guarantee that all
unreachable objects will have a reference count of zero, thus this method is not a
sufficient method on it’s own and is often used with the mark-and-sweep method.
An example of this method is shown in figure 4.
Figure 4
Explanation of figure 4:
For example, if the reference count of an object [i] has been decremented by 1, the
system checks if [i] has any references left. If [i] has no more references associated
with it the system starts regular reference counting. If object [i] still has some
associated references the system checks if the remaining references are cyclic and
deletes objects that receive only cyclic references. This involves having a container,
which stores objects and their reference count ‘I’ (internal reference count) from
objects within the sequence; the process starts from object [i] and finds all objects
reachable from it which are not root objects, sub-objects are added to the container
with the reference count of 1 as this new object has a reference from object [i]. At this
point the internal reference count for each object in the container is compared to it’s
reference count in the database, if an object has a reference for at least one object out
with the cyclic it is marked as a live object as are all objects reachable from this
object. If our start object [i] is not marked a live object all references to this object are
de-referenced in this case as all references to object [i] are cyclic which means they
are removed when they are de-referenced. This combination method is used in
MOOD: Material's Object-Oriented Database system.
An additional method called mark-and-compact this method again consists of two
phases: the first phase mark this starts at the persistent root objects it marks
everything referred to by the roots as required data, and then recursively marking
further objects referred to by already marked objects. The second phase is called the
compaction phase this collects anything that remains as garbage and compacts the
memory by moving all the live objects into continuous memory locations. This
method eliminates fragmentation and does not incur the additional costs of additional
space.
References
[1]ODMG 2.0: A Standard for Object Storage by Doug Barry Component
Strategies - July 1998.
[2]http://www.odmg.org/library/readingroom/Article - Components Strategies July98.html
[3] Garbage Collection In Java, Vern Martin, December 2, 1997:
[4] A presentation on mark-and-compact:
http://mood.mech.tohoku.ac.jp/impl/RefCount.html.
[Ref: Working with Persistent Objects – J. Eliot B. Moss]
[Ref: Atkinson 83 - An approach to persistent programming].
http://www.javacaps.com/java_serial.html
http://java.sun.com/j2se/1.4/docs/api/java/io/Serializable.html
[Ref: Databases: From Relational to Object-Oriented Systems – Claude Delobel]
[Ref: Object Databases an ODMG approach – Richard Cooper]
Figure Ref for “Persistence by Reachability” section and
onward
figure 1: OODB book
figure 2: OODB book
figure 3: Source: Chapter 9 of Inside the Java 2 Virtual Machine,
Garbage Collection, by Bill Venners, ISBN: 0-07-135093-4, McGraw-Hill.
figure 4: MOOD: Material's Object-Oriented Database: