Download Pure Java Databases for Deployed Applications

Pure Java Databases for Deployed Applications Nat Wyatt Informix Corporation [email protected] Abstract Most database systems are designed to be centralized – the database is installed once in a central location and the various applications connect to it. But there are many situations in which a database must be deployed – installed in remote locations, probably without the benefit of a trained database administrator. A deployable database system must be easy to install, require zero management while it’s running, and be capable of running on diverse platforms. It must also satisfy the traditional database requirements of standards, performance, and reliability These requirements can be met when the application and database are implemented in Java. Applications that use this architecture can be deployed and yet still retain many of the benefits that centralized database systems were designed to provide. This paper describes the situations in which deploying database applications is desirable, outlines the requirements for a database to be deployable, and describes the aspects of the Cloudscape pure-Java DBMS architecture which make it deployable. It concludes with an example of an application which uses a deployable Java DBMS. 1. Why deploy database applications Despite the current popularity of centralized database architecture, there are still situations in which it is necessary or desirable to deploy a database application. The reasons can be divided into three broad categories: performance and scalability, local autonomy, and situations in which the application is intrinsically distributed. For the purpose of this paper, the word deployment means the installation of software on a remote machine. Deployed database applications install a DBMS engine as part of the software, and keep data on the remote machine. A deployed database application may or may not be part of a distributed system. The simplest systems are simply standalone programs, and the most complex synchronize data over a network. In between, there are applications where the data is independent, but the systems communicate with each other with an application protocol. 1.1. Performance, availability, and scalability In any application involving a network and more than one computer, performance of the application depends on the relationship of three factors: the speed and capacity of the server computer, the speed and capacity of the network with respect to the amount of communication that is necessary, and the speed and capacity of the client computer. When the network is relatively slow and the client computer is relatively fast, the application benefits from placing more processing at the client. Applications which use slow networks such as dial-up modems or radio fall into this category. Even when the network is relatively fast, remote processing is desirable if the amount of network traffic would otherwise be excessively high. Similarly, if the network is unreliable or occasionally unavailable, remote users will have a better experience if more processing is performed locally. The system is more available when requests can be returned from a local store even if the network is unavailable. Most mobile and “occasionally connected” applications fall into this category. By moving more processing to the client, the server may be able to support more clients. The application as a whole can scale better if some processing can be moved from the single server machines to the numerous client machines. 1.2. Local autonomy Another reason for deploying data is to allow clients to have some control over the data they are using. There are areas in which clients may want autonomy: control over the data itself, and control over resource allocation for access to the data. 1 For some applications, the client thinks of the data as “theirs.” They may want to add to or modify “their” data in ways that are specific to the client. These additions and modifications may not be compatible with the needs of other clients, nor be acceptable to the administrators of the central system. For example, in a loosely coupled business such as a franchise, agency, or distributor, the central core business may provide information to and collect information from its affiliates. The affiliates typically focus on a specific segment of the business but are all nominally in competition with one another. Since each affiliate is focused on a specific market, it may want to enhance the information it gets with its own marketspecific enhancements, but it does not want to provide those additions to the other affiliates. Another area where clients may want autonomy has to do with control of the performance of the deployed application. When a client is sharing a server machine with a possibly unknown (and probably growing) number of other clients, the response time of the central system cannot be guaranteed. Rather than rely on the administrators of the central system to increase capacity accordingly, clients can guarantee response time and availability by providing the computing resources themselves. An example of resource autonomy might be in a supply chain application. Suppose Company A provides product information and Company B uses it to make purchasing decisions. If the information is provided over a network, then Company B has to trust that Company A will scale up their site as its business expands. But if A provides B with a copy of the data (by deploying a database application), Company B can guarantee response to queries because it administers the machine on which that application runs. 1.3. Intrinsic distribution There are also intrinsically distributed applications – applications that are impossible to centralize no matter how fast a network there is. An example of this is in network management. An IP router that maintains statistics on packet traffic cannot send information about each packet to a central system. More network traffic would be generated by the management information than the network traffic being monitored! A deployed database application running on the router would enable it to collect information locally and aggregate it into summaries. 2. Deployed application requirements Despite the cases in which deploying an application would be desirable or necessary, many applications that could be deployed, are not. This is because of the difficulties in distributing, installing, managing, and evolving the remote application. Centralized applications do not have these difficulties. A server-centric application runs on computers that are physically located at the organization supporting the application. Proximity makes it easy for the administrators to manage it, since they can physically operate the computers, if necessary. Furthermore, if the application needs to be changed, the changes can be applied once at the central site. Clients immediately see the changes and there is no need to distribute the changes to all the client sites. Lastly, the ubiquity of web browsers means that clients can access the application from anywhere as long as they have a network connection. Ideally, a deployed application would be just as easy to manage and deploy, and would provide the same ubiquitous client access. 2.1. Simplified administration Applications that use databases are notoriously hard to manage. Much of the trouble is that database systems have not been designed to be easy to use. But there are two intrinsic problems. One problem is that the DBMS runs as a separate process or processes, which must be managed separately from the application server. Typically the way to manage the database server is different from the way that the application server is managed. The administrator has a numerous “moving parts” to understand and manage. The other intrinsic problem is that database systems are very general, and the usual goal is to get every last bit of performance. It’s difficult to configure the system to match the application architecture, and it’s difficult to tune the system for maximum performance. 2.2. Easy to modify the application The current trend towards web-based application deployment is in part a reaction to the difficulty of managing client software in client/server deployment. This is daunting when the application is simply a database client, but is even more difficult if the deployed application contains a database. 2 The problems that are general to client/server apply: difficulty of installing new versions of the application, and issues having to do with the version of the client not matching the version of the server. But deployed databases have a particular problem as well. Each client retains data over application upgrades, but the database schema or even the version of the DBMS may have changed as part of the new version. This means that the schema changes must be applied remotely, and perhaps even the database schema must be upgraded remotely. 2.3. Ubiquitous client access Perhaps the biggest benefit of web-based deployment is that the web browser that runs it on the client can be expected to be there, and most applications can be developed to take into account the minor differences between browsers. Generally speaking, it does not matter what hardware or operating system is running the browser. This makes the application immediately available to almost every computer. This is in contrast to the situation of deploying most software, where the application developer has to provide a complete software configuration for each hardware platform and operating system configuration. The configurations may have to be different even across operating system versions. When the application is assembled from components (such as an application server and a database), the configuration problem applies to all of the components. So the problem for the application developer becomes combinatorial: assemble the application from mutually compatible software components that run on the combination of software platforms. 3. Deployable database requirements A database that is to be included with a deployed application has a set of requirements that derive from being a database and needing to be deployed. Since it’s a database, it has to be • • • Fast Reliable Standards compliant Since it is to be deployable, it must be • Portable • • Manageable Easy to integrate into the application The purpose of portability is to provide ubiquitous client access. Portability applies not only to the DBMS engine itself, but also to the initial data which may be shipped with the application. Since it is desirable that the receiver of the deployed application not have to be a database expert, the database must be easy to manage. Easy to manage includes all aspects of database operation including installation as well as day-to-day maintenance. A database which is deployed as part of an application should be well-integrated with it. This means that the database should not impose its own installation mechanism, user interface, management interface, or other artifact on the application. Ideally, the database should be invisible, with all database-related operations (if any) appearing to be part of the application. Supporting the application lifecycle means that it must be possible to make changes to deployed applications as well as install them. If the application includes a database then it must be possible to upgrade the database remotely, in place. Upgrading the database may involve moving to a newer database version, updating the schema, and possibly synchronizing the data. A DBMS must meet these requirements in order enable the application it is a part of to be deployable. 4. Java and deployment By implementing a database in Java, two of the requirements of deployability can be satisfied. Java is well known to be a way of developing portable programs. Since Java programs are stored in a format which is the same on all platforms, and they are run by a virtual machine which is defined to work the same on all platforms. These characteristics enable Java programs to be distributed to and run on any machine with a Java virtual machine. Since Java virtual machines are available for almost all platforms, Java programs are quite portable. Java also addresses the requirement for embeddability. Embeddability means the capability of embedding one program as a software component in another program. It’s necessary in a deployed database application in order to make the application easier to manage. 3 Embedding may be possible for relatively simple or well-integrated components using native code compilers, although the configuration issues still arise. The problem is, unless the systems use the same mechanisms for managing system resources, they will conflict with each other. For example, on Unix systems programs use signal masks and handlers for exception handling. Unless all the components in the system use the same conventions (and different signals), they will not be able to work together. Another problem with embedding and C-based software is that systems built in C or C++ are not “pointer safe.” Bugs in either system can cause corruption in the other. These issues are among the reasons for the popularity of network-based servers, where integration takes place over a well-defined socket protocol. Each server runs in its own resource space, with memory protected by memory mapping hardware. Database systems are extremely wary of having their memory corrupted (since one of their roles is to protect against the corruption of data), and often have special resource management techniques to improve performance. Java provides solutions to these problems. In addition to being pointer safe, it provides a standard way of managing memory, threads, and exceptions. It has a built-in internationalization support, as well as a many standard libraries for handling configuration. Its object model allows for pure interfaces as well as private classes, methods, and fields. Its class loading mechanism allows components to prevent clients from ever having references to internal objects. There are some other benefits of a pure-Java DBMS architecture that are not necessarily associated with deployment. Briefly, these include: • • • • The developer can choose which Java Virtual Machine to use for the application. This can provide consistency across clients and servers, and allows the developer to use the JVM which works best for the application. Java Virtual Machine performance is still improving rapidly, much more rapidly than the performance of compiled C code is improving. As the JVM performance improves, the performance of the DBMS improves as well. Since Java provides many low-level services such as memory management, threads, and exceptions, the DBMS implementer doesn’t have to. Java is a fast, and safe programming environment. This allows the DBMS developer build the engine • and make improvements to it faster and with fewer bugs than in a C-based system. Finally, because the engine is Java, it is extremely easy to extend the SQL language with Java extensions. There is no penalty for crossing the “boundary” between SQL and Java. 5. Java database architecture The architecture of a Java DBMS derives from the requirements of databases in general and the requirements of deployment. It must be fast, reliable, and standards based. It must be portable to any hardware/OS platform, be embeddable into the client application or server, and be easy to install and manage. Ideally it would be easy to extend the capabilities of the DBMS with Java. Being implemented in pure Java satisfies the requirement for portability, and with a little care also the requirement for embeddability. However, performance, reliability, and simplicity of management must be provided by the design of the database. The following sections illustrate how some of these requirements are met by describing some elements of the architecture of the Cloudscape ORDBMS, an example of a pure Java database system. An overview of the Cloudscape system is given in [1], and the complete documentation is online at [2]. 5.1. Application architecture Cloudscape is an object-relational database management engine implemented as an embeddable, pure Java class library. It supports the SQL92 database language standard, with extensions for storing and retrieving Java objects in tables and accessing Java objects in queries. Its stored procedure language is Java. Applications access data managed by the Cloudscape engine via the JDBC (Java Database Connectivity) API [3]. This is the only API that clients use to access the database; there is no need for the application to depend on any Cloudscape-specific API. This means that applications that run against Cloudscape can run against other databases simply by switching the JDBC driver. Conversely, any JDBC-compliant application can run against Cloudscape. Since the client and server are in the same JVM, the Cloudscape JDBC driver transfers data between client and server without the need for network communication. When the application requests a 4 connection from the JDBC driver manager, the Cloudscape JDBC driver checks to see if the requested database has been started within that JVM. If not, the database is booted. Either way the driver manager returns a connection object which is simply a reference to an object within the database. JVM Application JDBC API This server framework defines a network protocol for transmitting JDBC requests and results. It listens on a network port, “forwards” JDBC requests to the embedded JDBC driver, and returns the results. The client piece operates as a standard JDBC driver. In this configuration, the Cloudscape database engine looks much like any other client/server database system. However, in most Cloudscape applications, the server framework is provided by the application developer, not by Cloudscape. The server framework communicates with its clients using a network protocol that is most appropriate for the application. The most common framework/protocol pair is Servlets/HTTP, but others include CORBA/IIOP, and other application server frameworks and their protocols. Cloudscape A typical Servlet/HTTP architecture looks like this: HTTP Web Server Disk Servlets When the application uses the connection object, parameters and results are simply passed across the method-call boundary. There is no marshalling, unmarshalling, or network overhead. Results can be returned as direct pointers to data within the database engine when the result objects are immutable values (as most of the basic Java objects are). The database (usually) stores data on disk. Even though it is embedded, the database engine supports multiple simultaneous connections from multiple client threads. 5.2. Server frameworks The architecture is client server; a single database engine provides all access to the database file. In contrast to most client/server database systems, for which the main API is via a transport protocol over a network socket, the Cloudscape database provides no intrinsic networking support. In order for it to communicate with other processes, it must run “inside” a server framework. A server framework is simply a kind of application which calls Cloudscape over its embedded JDBC driver and communicates with other processes via some network protocol. Cloudscape provides a JDBC server framework and corresponding client library as a separate product. JDBC API Cloudscape In this architecture, the web server, the application logic (embodied in Servlets), and the data persistence capability are all combined into a single process. There is no general way for a client to issue SQL queries against the database. All access to the data is mediated by the application logic. 5.3. Pure Java implementation Since the Cloudscape DBMS is designed to be portable to any Java environment, it naturally contains no nonJava code. The engine is packaged as a class library in a standard Java archive (jar file). It supports all versions of Java from JDK 1.1 on, and automatically adjusts so that the same jar file will work with JDK 1.1 or 1.2. Compatibility is confirmed by extensive testing on a wide variety of Java Virtual Machines from various vendors on multiple platforms. 5 The pure Java implementation and extensive testing for compatibility makes the database run on almost any platform, makes it possible to add Java database capability to almost any existing Java application or framework, and allows the application developer to select the JVM which best suits the needs of the application. 5.4. Resource management Since the Cloudscape DBMS is designed to run in the same process as the application, it must share resources with the enclosing application. 5.4.1. Threads. The Cloudscape DBMS does not have threads of its own for processing database requests. Instead, it “borrows” the thread of the calling application. All Java execution is inside threads. When a thread makes a method call into the DBMS engine, the caller’s thread is used to perform the database activity. If the database needs to block for a lock or I/O, it’s the caller’s thread that ends up blocking. This approach to thread management has a number of advantages. First, use of native Java threads means that threads performing database work happily coexist with other threads in the JVM, and doing database work does not add extra threads to the application. Second, the database can handle a large number of connections because the connection overhead is simply a few objects. Third, the application can use a connection from any thread; this makes connection pooling easy. Finally, use of the native threads enables the engine to take advantage of multiple processors on JVMs that support multiprocessing. 5.4.2. Memory. The Cloudscape DBMS allocates memory from the standard Java heap. Since the heap is being shared with the enclosing application, as much memory as possible is managed so that the JVM’s garbage collector can return it for reuse by the application. There are places where large or frequently reused objects (such as data page buffers) are retained by the DBMS even when it is idle. 5.4.3. Network sockets. The Cloudscape DBMS does not use any network sockets, partly because they’re not needed in an embedded architecture, but also to avoid the configuration issue of having to decide on port numbers. 5.4.4. System functions. The DBMS is very careful not to use system functions in a way that would conflict with the enclosing application. For example, calling System.exit() would not give the enclosing application a chance to do a graceful shutdown. 5.5. Performance A database must be fast, and special attention must be paid to performance in a Java architecture. JVM performance is improving rapidly, but there is still room for improvement. Java performance does not yet match that of C or C++. As with all programming environments, care must be taken to understand the performance implications inherent in the environment. The Cloudscape system uses the following techniques among others to get good performance. 5.5.1. Reuse of immutable objects. Java provides automatic memory management, and the Cloudscape DBMS takes advantage of it where appropriate. One area where automatic garbage collection helps is in enabling the reuse of immutable objects. Immutable objects can be shared easily, and the garbage collector guarantees that they will be reclaimed when no longer used. The application program does not have to track them which saves complexity and overhead, and clients do not have to obey any particular protocol to release them. One area in particular where the Cloudscape system uses this technique is in returning column values to the client. Objects are converted from their on-disk stored format once, qualified by the query execution system, and returned directly to the client. This avoids marshalling/unmarshalling overhead, and saves memory because the result objects are shared between the client and the DBMS. 5.5.2. Careful memory management. Having automatic memory management does not mean that the system can be profligate with memory. Allocating memory takes time, and the more memory that is allocated the more often the garbage collector has to run to reclaim it. When appropriate, the Cloudscape system caches and reuses objects to avoid creating new objects. Getting an object out of a cache is not necessarily faster (the overhead of cycling objects through a cache is typically higher than the overhead of creating new ones). However, avoiding allocations reduces the number of times the garbage collector has to run, and makes its job easier when it does run. Maintaining an object in a cache effectively reserves that memory for the DBMS. This conflicts with the resource management principle of allowing memory to 6 be recycled from the DBMS to the application. Therefore caches are used only in those cases where memory reservation is appropriate, such as caches of locks, disk page buffers, and the like. interfaces, each subsystem, or service, has a welldefined interface that it implements. The component framework registers and finds services and coordinates the startup and shutdown of the system. 5.5.3. Awareness of library method performance. When making calls to standard Java library methods, it is important to understand the performance implications. For example, operations on strings can have many hidden object allocations, and date processing can be excessively slow. Similarly, it’s important to understand whether them method uses synchronization. For example most methods on Hashtable are synchronized, so Hashtables are note used on performance critical code paths. The component architecture enables the same software code base to be used to create multiple configurations of the DBMS. This capability is used to build specialized engines, add access methods, and provide special-purpose support for 5.5.4. Traditional techniques. Since the execution speed of Java is generally slower than that of C or C++, relatively more performance gain is achieved by traditional performance optimization techniques. These include query optimization techniques and plain old optimization of hot code paths. 5.6. Component architecture The component architecture of the Cloudscape DBMS engine is similar to many other database engines. It consists of a (very thin) JDBC API layer that interacts with a SQL language processing system, which uses a storage manager for accessing data. The storage system has no dependencies on the SQL system, and the SQL system has no dependencies on the network. All three major subsystems make use of a set of lowlevel basic services. All of the subsystems are implemented as modular components, which are supported by a component framework. Component Framework JDBC API SQL Storage 5.7. Component framework In contrast to some architectures, the Cloudscape DBMS engine is not monolithic. Using Java 5.8. Basic services The basic services consist of a set of small, common services, which all the database components can use. They include things like generic cache management, daemons, cryptography, license management, error message repository, and diagnostic output. 5.9. JDBC API The JDBC API is the only entry point into the Cloudscape DBMS. When a caller calls a JDBC API method, the caller’s thread is associated with the connection being used for the operation, synchronization is obtained to prevent simultaneous use of the connection, and an error handler is set up to deal with any exceptions that might occur in the engine. 5.10. SQL language The SQL language subsystem is relatively complex, consisting of three high-level modules that in turn contain multiple smaller sub-modules. The three toplevel modules correspond to its three main responsibilities: maintaining schema information, compiling SQL queries, and executing queries. The SQL subsystem does cost based optimization, and supports multiple join strategies (nested loop and hash). While much of the SQL subsystem performs tasks which are common to many database systems, the following aspects derive from its mission to be part of an embeddable Java database. 5.10.1. SQL queries are compiled into Java classes. Executing a query is simply a matter of creating an instance of the class, substituting the parameters, and calling its run method. This makes query execution very fast. 5.10.2. Data statistics derived from indexes. To avoid the configuration problem of when to collect 7 data distribution statistics, the query optimizer collects distribution information directly from database indexes. 5.10.3. Automatic recompilation. Query plans are automatically recompiled when tables grow or shrink significantly. This allows the optimizer to choose new query plans if necessary. 5.11. Storage The storage subsystem provides data storage, access methods, concurrency control, transaction management, crash recovery, and backup and restore functions. 5.11.1. Portable database format. The format for data in the database is as defined by Java. Since this format is the same on all platforms, databases themselves are portable. This means that a database can be created on one machine and sent to another, which is an important aspect of deployabilty. 5.11.2. Format versioning. In order to allow the database layout to evolve without requiring databases to be unloaded and reloaded, every format has a format id. If a format changes, it is assigned new format id. Formats which are no longer current are converted into the new format as they are encountered. 5.11.3. Data in files. A Cloudscape database is a collection of operating system files, rather than being based on hard-to-manage raw partitions. This avoids a source of difficulty in administration. 5.11.4. I/O parallelism. There is a hard-to-avoid tradeoff between easy management and high performance in providing I/O parallelism. I/O parallelism is important for good performance, but inherently hard to manage. Rather than provide a complex mechanism for spreading data across disk spindles, with the associated problems of data partitioning, Cloudscape assumes that I/O parallelism will be provided by the filesystem (as by a RAID disk array). This won’t provide optimal performance, but will provide better performance and it is much easier to manage. It is, however, possible to move the transaction log to a separate device from the data. The storage system does not write data pages at transaction commit, but it does have to flush the transaction log. Putting the log on a separate device is such an advantage that it is worth the extra management overhead. 5.11.5. Multiple filesystems. In order to support a diverse set of target platforms, the storage system accesses storage through an abstract filesystem API. This allows the database to work in environments where there are no disks, or the behavior of the filesystem is not disk-like. The filesystem implementation of a given database system is configured via the component module system. 5.11.6. Default checkpoint and backup model. By default, checkpoints which sync data buffers to disk, automatically truncate the transaction log. This avoids the problem of having to manually truncate the log to avoid filling up the disk. However, it also means that backups must be made by making complete copies of the database. By default, a Cloudscape database assumes that the application will use operating system tools to make the copy, and provides an API to coordinate database operations with the copy. 6. Application example An example of the usage of an embeddable Java database in a production application is the use of a Java DBMS within a distributed e-commerce network management framework. This management system makes it easy to monitor the performance of an ecommerce site not only based on information from the e-commerce site itself, but also based on information collected at the site’s suppliers and service providers. The product in which the Java DBMS is embedded is a management server. The server monitors the various elements that make up an e-commerce operation, consolidates the information, and presents views of it to web-based clients. The elements being management are higher-level network components such as web servers, application servers, and databases. Clients get consolidated management information that reports on the speed, availability, and responsiveness of the ecommerce site as a whole. The management server is implemented in Java. A Java web server provides access to Servlets that customize the reports for the web clients. The Servlets query the database for the latest management information. Meanwhile, other threads in the management server collect management information and store it in the database. 8 Browsers HTTP Management Server Managed Elements Network Servlet Monitoring Logic JDBC Management servers can monitor other management servers. By deploying management servers to supplier and service provider sites, the management system can provide information about the e-commerce operation as a whole. Mgmt. Srvr. Internet Supplier Mgmt. Srvr. Provider eCommerce Business By using a Java based application architecture and a pure Java database, the users of this management system gain the following advantages: • • • 7. Conclusion There are many situations in which deploying a database would be useful, but DBMS architectures based on native code make such deployment difficult. Java DBMS Mgmt. Srvr. A management architecture based on a platformspecific, separate-server database architecture would not be able to obtain these benefits. An application architecture where a pure-Java database is embedded in a Java application makes deployment easy: it reduces the number of “moving parts” in the application, and it makes the resulting application portable to almost any computing platform. Such a Java DBMS must provide the usual database virtues of standards, performance, and reliability, but must also be easy to install and manage. A DBMS system which meets these requirements can provide the benefits of database systems in situations where databases would otherwise be unusable. 8. References [1] N. Wyatt, “Cloudscape 3.0: A Technical Overview”, Cloudscape, Inc., http://www.cloudscape.com/Products/WhitePapers/whi tepapers.htm [2] Cloudscape Product Documentation, http://www.cloudscape.com/support/documentation.ht ml [3] Sun Microsystems, “JDBC Data Access API”, http://java.sun.com/products/jdbc/ [4] Sun Microsystems, “Java Servlet API”, http://java.sun.com/products/servlet/index.html Easy installation and management. The database, web server, and management logic are all installed and managed as a single product, and all run within a single process. Fast access to customized management views. Since the management system uses a database to log and query management information, users can get fast access to customized views of the information. Ubiquitous clients. Since the management system is implemented in Java, it can be deployed to almost any platform. The management system can be deployed to suppliers, enabling it to monitor the distributed e-business as a whole. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pure Java Databases for Deployed Applications