Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Object Oriented Database (Group Write-up) Course: Matric Number: Term Time E-mail: Date: Course: CO42009 96061707 96061707@ napier.ac.uk 28/04/02 Beng Hons Computing Authors: Barry Myles, Stephanie Dunsire, Gary Stewart Presentation Title: Persistence in OODBs Napier University Introduction to Persistence “A persistent programming language (PPL) is a programming language that includes a persistent memory area (e.g., a heap of objects) that outlives the execution of any individual program.” [Ref: Working with Persistent Objects – J. Eliot B. Moss] To quantify what persistence means we must appreciate this subject in two ways. Persistence when related to an information system (such as a database or application), is the ability of the program to store its current state, on non volatile storage. This is done as to allow the program to survive past the last point in which it was run. For Instance any data held in RAM will not persist past the stage at which the power is cut as this type of memory is volatile; we therefore describe data in this area as transient data. However any piece of data held on secondary storage (for instance a hard drive) could conceivably be expected to still be present and retrievable when the system is turned off and back on. The second other manner, in which the term persistence is used, is to describe the length of time a single piece of data exists for. Consider a typical application written in block structured programming language for instance C++ or COBOL. We could confidently say that when processing this information there will exist several types of looping structure. If we further consider the contents of the “for” loop described in Fig 1 paying special attention to the variable declared as “k.” When the loop is constructed “k” is initialized. Every time the loop iterates “k” is incremented by one. If the break condition is met and the loop terminates, should we still retain the value of “k”? Sensibly the answer would be no! If we are to save the system from performance degradation due to space being occupied by insignificant data we should flag the memory reference or location for garbage collection From creation to deletion this type processing can be all over in a matter of nanoseconds. Meaning that the datum in question has persisted for a very short period of time. Conversely if we consider the regulations imposed recently on ISP (Internet Service Providers) whereby they are now duty-bound to log and prolong data pertaining to the movements of their customers on the WWW (World Wide Web.) This is an example of data persisting for a considerably longer period of time. FIG 1 For (int k = 0; k < theActions.length; k++) { if(theActions[k] == null) { panelPopup.addSeparator(); } else { panelPopup.add(theActions[k]); } } When coupling the programming power of an object-oriented language like Java or C++ with the data storage and manipulation mechanisms of a relational database, difficulties arise! These two components have there own very distinctive characteristics when it comes to the storage and orchestration of data. Programming languages in general are more akin with the manipulation of transient data. Relational databases on the other hand tend toward the manipulation of persistent data. “This requirement is evident from a database point of view, but a novelty from a programming language point of view, ” [Ref: Atkinson 83 - An approach to persistent programming]. It is clear then to successfully build a package with all the power from these two fundamental elements we would be unwise to negate the concept of either transient or persistent data; we must find a good mix of the two. Another issue, which arises in the collaboration, is the differences in the modelling paradigms. How do we map objects in an OO model to the tables in a relational database? Both of these problems lends themselves to the term know as impedance mismatch. There are ways round this although this is not the scope of this presentation. So far we’ve talked about what persistence is and some problems with it, now let’s move on to describe the mechanics of how the program writes of the persistent data to secondary storage. There are three types of persistence each has it’s own methods for dealing with the data presented. Kinds of Persistence The three types of persistence are Session Persistence This method saves the current state of execution by taking all the data being used and dumping the lot into secondary storage. Take for example the Windows 98 operating environment, a type of session persistence is used in the facility know as “hibernation mode.” A brief overview; The PC has the ability to temporarily shut itself down, when there has been no user input via either the keyboard or mouse for a set period of time. When shutting down all data which resides inside the current workspace (usually found in volatile stores such as RAM and or Paged Memory), is written of to the hard disk. Data which may exists in this workspace is usually operating system data, programs such as Internet Explorer, Borland C++ and data that these program rely on. When the user once again resumes his place behind the keyboard and reactivates the PC by wiggling the mouse. They are infact unwittingly resurrecting the last current execution point from secondary storage. Benefits of this approach are: Implementing this type of strategy is easy; everything in our memory area becomes persistent, there is no choice. Thus there is no risk of data being discarded which is needed later on. The logical data structure is saved by not having to break up the program data into chunks, also its relationship with the program at that time is also retained. Drawbacks of this approach: No way of deciding what is saved. No way to share the data with more than one user. This is due to the program code and program data being inseparable. File Persistence “Open” and “Save” functionality, is the most traditional way to allow the an application the ability to read or write the current state of it’s program data to or from secondary storage. Using file persistence we are provided with the means to separate program data from program code. When this data is “saved” it exists on secondary storage as a file. The structure of the file held on disk will differ significantly from the logical data structure held in main memory. To reverse this process and resurrect the state at which the machine was “saved”, we need to explicitly read all the information into memory and combine it again with the program code. We do this using the Open mechanism. Consider Java’s method for implementing file persistence. Each individual object’s current state can be changed to a binary stream, which is pumped out to secondary storage in file format. This type of process is called Object Serialization and is the easiest method of generating saved data, as all the programmer needs to do is specify which objects are to made persistent by including the words “implements java.io.Serializable” in the class declaration. Java then decides itself what information is needed to allow us to resurrect the original state. To recover the saved state we specify the file name and use object deserialization process to restore the object state back to main memory. Benefits of this approach are: The designer has the ability to nominate only the information, which is required to be stored. Unlike Session persistence, which stores all of the information in the workspace, File persistence can cut down on the data stored thus relieving the amount of storage space required. Since the data is in a file format it can be easily transferred from User to User. Drawbacks of this approach: Every program must include Open and Save functionality. This makes the system designers job harder as they have to implement and test these extra methods. We lose the simplicity of the single data model, as now we have three concepts to deal with each with differing data structure. The internal program model may contain lots of program specific data, which is, not capture in the external file format view. Now if we also take into consideration the designers abstract view of how the objects and data should be structured we have scope for confusion as to what is represented and stored. To transform any one of the 3 into another takes a great deal of processing and intellectual power. Orthogonal Persistence In orthogonal persistent systems the manner in which data is held on secondary storage exactly matches the logical structure of the internal in-memory representation. Any data, which is found in an orthogonal system, can be made persistent or transient. More over any class, which is declared as persistent, will automatically store all it’s lower classes as to maintain referential integrity. When a table in a Relational database is accessed for instance, the in-memory representation that is fetched and loaded into the paged memory will exactly resemble the contents of that section on the hard disk. Benefits of this approach are: We have one single data model to follow. To save loading every piece of data from secondary storage at once we only need to fetch what is needed and give pointer references to the rest. This saves the processor from carrying out a lot of needless swapping of data in and out of the main memory area. We can decide what data is to be saved. Drawbacks of this approach are: We cannot share this information easily with other systems. Distinguishing Transient and Persistent Data Introduction The implementation of persistence varies between databases, this in not because each development comes up with its own solution but because each different type of implementation comes with its own merits and flaws. The two main criteria are: How little effort it takes for the programmer to make objects persistent. Efficiency of the storage of data in the system. A better solution in one of these criteria may result in a performance loss in the second. An object can be identified as persistent in one of two ways. It can be explicitly marked as persistent by the programmer or it is referred to from a persistent object. The latter is not implemented in all persistent stores but lack of this feature may lead to dangling references. These will have to be dealt with by the database management system and dangling references are generally considered a bad thing since the data is no longer complete. All objects that are not in this set are transient and will not exist past the execution of the current code. For example a persistent system may require the programmer to mark which objects are to be persistent. This: Should minimise the storage requirements, by only storing objects that the programmer has explicitly told to be stored. Is trickier for the programmer as they have to decide what to put in to the database. May lead to items not being correctly made as persistent based on 'bad calls' made by the programmer. Another example is a system were the database management system deals with what is made persistent and what is not. This leads to: The programmer being abstracted from the low-level persistence system. Though there may be an ability for the programmer to indicate to the system that a class or instance should be made persistent. The DBMS will automatically upgrade something to persist, note there is no reason why something would be changed back to transient after being made persistent. Marking Persistence A database management system can distinguish transient and persistent data in different ways because each method has its own merits and flaws. Some database management systems will implement more than one, or allow the behavior of one system to be emulated by using another. There are seven main methods of marking persistence. Each of these will be looked into. Persistent classes If this system is used a class is declared persistent when it is created. These classes are explicitly persistent and all objects using them will be persistent. Other classes, not in this set are only made persistent by reachability, if this isn’t supported only objects whose classes are declared as persistent can ever be made persistent. These are obviously not orthogonal of type since persistence is only granted if they have been assigned this persistent behavior. The programmer may have to later convert a class if the model needs changing adding to the cost of development and reducing the flexibility of the system. Since a class is known to be persistent then the memory for an object of that class can automatically be put aside in the persistent store when that object is created. If a class has been created by third party then there may be no way to alter it and hence change it to a persistent/transient type. Persistent shadow classes Persistent Shadow Classes automatically makes two classes for each class a programmer generates. Then the programmer can either create a transient object or a persistent object depending on which class they call. The transient version cannot be upgraded to a persistent version however it should be possible to copy a transient object to a persistent object of the same type. Because it is known whether an object is persistent or not at creation memory in the persistent store can be set aside for it, improving the efficiency. This, unlike Persistent Classes insures that all classes have a persistent capable version that removes the problem of a third party development not having persistent capabilities. Persistent Root Classes This uses a root persistent class that must be inherited to make a class persistent. Would restrict behavior of classes a great deal in hierarchical (single inheritance) languages as the class can only inherit from the root class or one of its subclasses. It may be possible to get around this in languages like Java but using interfaces such as java.io.Seralizable. It works a lot better in multiple inheritance based languages however it may still lead to complications based on the tricky way multiple inheritance tends to work. The object database management group's standard uses this for C++ bindings in their system. Persistence Declared at Object Creation This gives a mechanism in which when an object is created it can be declared as persistent. This will be noted by some key word when the object is dynamically created. This involves more work for the implementation programmer, as they will have to decide, during their coding what objects are to be elevated. It is a flexible system since any object can be made persistent, and it is not based on class. Since the object, when dynamically created, is known to be persistent the memory can be set aside on the persistent store, so it still remains reasonable efficient. The program still has to be careful not to waste space by retaining values that are designed only to be used within the function. Especially if that function is called many times during execution. The object can still not be, under the default model of this, made persistent after it has been declared. Though it might be possible to assign a non-persistent object to a persistent one, hence making it persistent but this has a run-time cost. Persistence by Explicit Storage Persistence by Explicit Storage allows the programmer to promote an object to be persistent at run time by using a keyword. The object is then copied to the persistent store, which obviously has an overhead. This is one of the more flexible systems as any particular instances of a class can be made persistent and this can happen at run time. This therefore obviously supports upgrading objects to persistent at run time and persistent versions of objects should be comparable with their transient equivalent. System-Provided Persistent Roots System-Provided Persistent Roots provide a class that has a form of container in which to place objects. Any object placed within this container is made persistent. The ideal system should allow all objects and primitives to be added hence all possibilities can be made persistent. It should also be globally available to insure that the programmer can make an object persistent anywhere within their code. If that object is large or has many links then the process of copying to persistence store will be a large overhead. This behavior implies that persistence by reference must exist. The object database management group uses this for its Java and Smalltalk bindings. This is a better choice than Persistent Root Classes for languages such as Java since they do not support multiple inheritance and have an easily placeable hierarchy structure for their classes. In Java all classes are children of java.lang.Object, which can therefore be used as a generic class in which to contain. Named Root Objects By assigning something as a named root object then it becomes a root of persistence much like the system provided persistent roots. Like system provided persistent roots this implies that persistence by reference must exist. The benefit of this over system-provided persistent roots is that the programmer gets to dictate what begins this process. The programmer 'flags' an object that will promote it to persistent then all other objects that are linked to this object can be promoted through persistence by reachability. Persistence by Reachability Persistence by reachability means that if an object is persistent all objects that are referred to by that object must also be made persistent or the references would become invalid. Explicit persistent objects therefore dictate what is made persistent however an object that is being promoted to persistent in turn may have its own references that must go through the same process, hence a cascading effect of promotion. Large objects with lots of references will produce a significant overhead while being copied to the persistent memory store. Programmers might be able to denote fields in objects that are never to be made persistent (transient keyword in Java) hence saving space in the persistent store. This has the potential problem of not storing things that will be required when the code is reloaded. Persistence by reachability maintains referential integrity by insuring that all objects that are referred to are stored. It also may take away a significant portion of the workload of the programmer as they do not need to provide explicit persistence too much of the database. It may however result in a loss of performance as more data than is required may be stored and retrieved by the database. This performance loss is better than the consequence of having the database refer to something that was transient and therefore no longer exists in the database. Storing Code and Data The importance of Storing Code and Data is highlighted by the kinds of application that require the use of databases as they hold very complex information. The below diagram shows file systems, code is shown as ovals, while data is shown as rectangles: Figure 1 System (a) in figure 1 is an unprotected file system however it has a flexible environment in which code and data can be freely mixed. System (b) in figure 1 is a traditional DBMS that controls part of the file system, which it uses to store and protect the data. Application code remains outside the control of the DBMS. System (c) in figure 1 is an orthogonally persistent language, which allows code to be protected as well and for data and code to be freely mixed. System (d) in figure 1 is an OODBMS file system that provides a class structure in which the mixture of code and data is controlled. Systems which exclude the program code from the structure of the database, can be expected to increasingly fall into disuse because they make the creation and maintenance of the applications exceptionally costly and error-prone For instance programming languages with file persistence store programs in one set of files and data in another set of files, the data and program files are controlled using the directory structure of the file system. As this method provides no integrity services accessing data files logically is the responsibility of the application program and the software development environment tools for example the Unix tool SCCS. Programming languages with orthogonal persistence deal with logic through removing any distinction between code and data. As code is treated in the same manner as normal data it can be stored and retrieved from the database. As traditional database systems provide a severely imposed structure for the data and deal with the organization of the data they are a great deal more organised with regard to data. The problem with this approach is that although data is kept very consistent there is no way of keeping application code consistence either themselves or with the data as no structure is provided to keep the code and data together. Removing Data An implicit capability, which database systems must provide, is the facility to dispose of data, which is no longer of use to the system. To enable unnecessary data to be discarded there must a way in which the system can categorize data to state that it is to be removed. The method used to identify unnecessary data is similar to the method used to categorize persistent data for instance redundant data can either be explicitly removed by the programmer or they can be deleted mechanically by the system after the data becomes unusable i.e. the data can not be reached via the roots of persistence thus the data management systems can provide clear delete operations (i.e. a C++ destructor) or data can be removed via the use of garbage collection. Some systems can use a blend of these two approaches. Explicit Deletion of Data – Systems can allow explicit deletion of objects and classes. Some systems allow this, as it is believed that the programmer can achieve the most efficient storage of data. This method can avoid integrity violations for example by using C++ an operation called a destructor can be used, thus it can be used to explicitly remove data without causing deletion violation. Garbage Collection – Systems can also allow an automatic method of removing data called garbage collection, a popular use of garbage collection is in an OODBMS, which uses persistence through reachability. An example of this can be shown through a course being cancelled at Napier due to this all students on this course can be relocated on other courses or leave (in this scenario we will ignore the students that leave), the course attribute of the student objects will be changed to their new courses. Once all objects representing the members of the course are updated there will be no references left from the student objects to the old course object thus it can be removed from the object holding the set of courses and as there will be no existing references to the object it will become unusable data the below diagram shows an object graph with garbage and non-garbage aspects: Figure 2 When garbage collection occurs there are two stages, which must be performed: identification of the garbage data (mark): This starts at the persistent root objects it marks everything referred to by the roots as required data, and then recursively marking further objects referred to by already marked objects. Eventually no more objects will be marked. Anything left unmarked is garbage at this point the second stage is started the removal of the unwanted objects (sweep) so the space is freed and returned to the free pool for future reallocation. In garbage collection these stages are usually kept separate as it is simpler this method is usually called mark-andsweep. The problem with the mark-and-sweep method of garbage collection is that it has the unfortunate tendency to fragment the memory (heap). This occurs when a long-running program has undergone garbage collection several times. The dilemma takes place when objects become spread out in the heap, hence objects are split by small vacant memory regions, this gives rise to the problem that memory allocation for an object may not be possible as although there is ample vacant memory it is not in a continuous block. Another method of garbage collection called stop-and-copy [3] (Microsoft use this algorithm for Java garbage collection) this method collects garbage and de-fragments the heap, this method involves the memory being divided into two regions an active region and an inactive region (which switch with every cycle of the method). For any moment in time all dynamically allocated live object instances will reside in the active region, the inactive region is empty. The stop aspect of the method starts when the memory in the active region is exhausted, the copy aspect of the method copies all of the live objects from the active region to the inactive region, when each object is copied it’s references are up dated to point towards there new locations. The final step is to switch the active and inactive regions, as this method only copies the live objects any objects left over are garbage and are deleted all at once when the active region becomes inactive, this results in object being stored in continuous memory locations thus the heap is automatically de-fragmented. The problem with this method is the cost of doubling the size of the heap (i.e. active and inactive). An example of this method is shown in figure 3. Figure 3 Explanation of figure 3: Stage 1: The lower half of memory is the inactive area and the upper active region is partly used. Stage 2: This shows the active region gradually being filled with objects. Stage 3: This shows the active region is full at this point the collector stops. Stage 4: The live objects in the active region (upper) are copied to the inactive region (lower). Stage 5: This is the state the memory is in when the process is complete Stage 6: The process is now restarted with the lower region is the active area and the upper region is the inactive non-used area and is gradually being filled with objects. Stage 7: This again shows the active region is full at this point the collector stops. Stage 8: The live objects in the active region (lower) are copied to the inactive region (upper). Stage 9: This is the state the memory is in when the process is completed, this process can occur several times. A further method of garbage collection is reference counting this method involves every object keeping a count of all references made to it, thus every time a new reference is made this total is incremented and every time a reference is removed the total is decremented. If the total is zero the object is unreachable and so is garbage and can be removed, a garbage collector can find objects with a reference count of zero and reclaim the space they use. The two main problems of this method: the counts and maintaining them can take up space and there is no guarantee that all unreachable objects will have a reference count of zero, thus this method is not a sufficient method on it’s own and is often used with the mark-and-sweep method. An example of this method is shown in figure 4. Figure 4 Explanation of figure 4: For example, if the reference count of an object [i] has been decremented by 1, the system checks if [i] has any references left. If [i] has no more references associated with it the system starts regular reference counting. If object [i] still has some associated references the system checks if the remaining references are cyclic and deletes objects that receive only cyclic references. This involves having a container, which stores objects and their reference count ‘I’ (internal reference count) from objects within the sequence; the process starts from object [i] and finds all objects reachable from it which are not root objects, sub-objects are added to the container with the reference count of 1 as this new object has a reference from object [i]. At this point the internal reference count for each object in the container is compared to it’s reference count in the database, if an object has a reference for at least one object out with the cyclic it is marked as a live object as are all objects reachable from this object. If our start object [i] is not marked a live object all references to this object are de-referenced in this case as all references to object [i] are cyclic which means they are removed when they are de-referenced. This combination method is used in MOOD: Material's Object-Oriented Database system. An additional method called mark-and-compact this method again consists of two phases: the first phase mark this starts at the persistent root objects it marks everything referred to by the roots as required data, and then recursively marking further objects referred to by already marked objects. The second phase is called the compaction phase this collects anything that remains as garbage and compacts the memory by moving all the live objects into continuous memory locations. This method eliminates fragmentation and does not incur the additional costs of additional space. References [1]ODMG 2.0: A Standard for Object Storage by Doug Barry Component Strategies - July 1998. [2]http://www.odmg.org/library/readingroom/Article - Components Strategies July98.html [3] Garbage Collection In Java, Vern Martin, December 2, 1997: [4] A presentation on mark-and-compact: http://mood.mech.tohoku.ac.jp/impl/RefCount.html. [Ref: Working with Persistent Objects – J. Eliot B. Moss] [Ref: Atkinson 83 - An approach to persistent programming]. http://www.javacaps.com/java_serial.html http://java.sun.com/j2se/1.4/docs/api/java/io/Serializable.html [Ref: Databases: From Relational to Object-Oriented Systems – Claude Delobel] [Ref: Object Databases an ODMG approach – Richard Cooper] Figure Ref for “Persistence by Reachability” section and onward figure 1: OODB book figure 2: OODB book figure 3: Source: Chapter 9 of Inside the Java 2 Virtual Machine, Garbage Collection, by Bill Venners, ISBN: 0-07-135093-4, McGraw-Hill. figure 4: MOOD: Material's Object-Oriented Database: