* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download THE DATABLITZ™ MAIN-MEMORY STORAGE MANAGER
Microsoft SQL Server wikipedia , lookup
Oracle Database wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Commitment ordering wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Serializability wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
ContactPoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
THE DATABLITZ™ MAIN-MEMORY STORAGE MANAGER: ARCHITECTURE, PERFORMANCE, AND EXPERIENCE* Jerry D. Baulier, Philip Bohannon, Amit Khivesara, Henry F. Korth†, Rajeev Rastogi, Avi Silberschatz, and S. Sudarshan‡ Bell Laboratories Lucent Technologies, Inc. 700 Mountain Ave., Murray Hill, NJ 07974 {jdb,bohannon,khivi,hfk,rastogi,avi,sudarsha}@lucent.com Brian Sayrs** Lockheed Martin Advanced Technology Laboratories 1 Federal Street, Camden, NJ 08102 ABSTRACT General-purpose commercial database systems, though widely used, fail to meet the performance requirements of applications requiring short, predictable response times, and extremely high throughput rates. As a result, most high performance applications are custom designed and lack the flexibility needed to adapt to unforeseen, evolving requirements. In the military domain, command centers often contain numerous “stovepipe” systems unable to share data easily. The need for improved data management is apparent with the rapid growth of communication networks and the increasing demand by end users for network-centric solutions that require flexibility and high performance. These applications share the need for real-time response to a dynamically changing external environment; the need to store a substantial amount of data; and the need to process transactions that have the usual ACID guarantees of traditional database systems. The above considerations indicate a database system design in which the data resides in main memory and disks are used to store checkpoints and logs. While a commercial database system can be adapted to this environment by making a large buffer available in main memory, it is possible to achieve significant gains in performance and response time by designing a database system tuned to this environment. In this paper, our focus is on the storage manager component of a database *Prepared through collaborative participation in the Advanced Telecommunications & Information Distributed Research Program (ATIRP) Consortium sponsored by the U.S. Army Research Laboratory under the Federated Laboratory Program, Cooperative Agreement DAAL01-96-2-0002. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. *Copyright @1998 Lucent Technologies, Inc. †Contact author. ‡Current address: Department of Computer Science, IIT Bombay. **Lockheed Martin was a Beta test site for the DataBlitz tool kit. system. This component provides basic database functions such as disk and memory management, transactions (concurrency and recovery), and data access paths. 1. INTRODUCTION Military applications that require real-time response to a dynamically changing external environment cannot usually tolerate the unpredictable delays associated with commercial database systems. While a high degree of disk parallelism can be used to achieve high rates of throughput, disk-based systems cannot achieve predictable response times in the tens of milliseconds. Main memory is the only technology capable of these characteristics. In the past, the high cost and size limitations of main memory lead to the development of specialized memory and data management systems which can be very costly to maintain and difficult to extend. Furthermore, many military applications operate in embedded environments (e.g., satellites, fighter jets, ships) that run real-time operating systems and favor light-weight configurable database systems with small footprint sizes. 2. BACKGROUND While there has been much work on storage managers, including Exodus [CDRS89] and Starburst [HCL+90], there has been little work on main memory storage managers until recently.1 Within the past several months, substantial interest in main-memory storage managers has emerged, as evidenced by such commercial systems as TimesTen (based on the Hewlett Packard Smallbase project), Angara, and the DataBlitz™ system. The current 1 System M [SGM90] is a transaction processing test-bed for memory resident data, but is not a full feature storage manager. practice in main-memory systems is to code specialpurpose, custom systems. This practice arises from the lack (until recently) of viable commercial alternatives, and from practical cost constraints mandating a minimalist approach to the use of hardware resources. vides support for fine-grained concurrency control at both the lock (e.g., record level locking) and latch levels (e.g., for protecting system structures like the lock table), and fuzzy checkpoints that minimally interfere with transaction processing. We have seen increasing interest in employing general-purpose storage managers in place of existing legacy systems. There are also numerous efforts in the military to create sharable data models, such as the Army’s Common Database (ACDB) program, which encourage the use of general purpose storage managers. However, the lack of predictable performance in existing database systems has led to the development of special purpose filter and replication mechanisms. Other principles that have guided DataBlitz's implementation are a toolkit approach and support for multiple interface levels. The former implies, for example, that logging facilities can be turned off for data that need not be persistent, and locking can be turned off if data is private to a process. The latter principle means that lowlevel index components are exposed to the user so that critical application components can be optimized with special implementations. Although application developers may prefer the high-level relational interface for its ease of use, our experiments indicate that low-level interfaces can provide substantial performance gains. Also, exporting the low-level interfaces enables DataBlitz to be customized to meet specific application needs such as small executable size. DataBlitz is a storage manager for persistent data whose architecture has been optimized for environments where the database is main-memory resident. A number of principles have evolved with DataBlitz. The first principle is direct access to data. DataBlitz uses a memory-mapped architecture, where the database is mapped into the virtual address space of the process, allowing the user to acquire pointers directly to information stored in the database. A related principle is no interprocess communication for basic system services. All concurrency control and logging services are provided via shared memory rather than communication with a server. These principles contribute significantly to improving performance by eliminating from the execution path of data accesses (1) expensive remote procedure calls, and (2) costly buffer manager processing overhead. DataBlitz has been in use at several beta sites, both within and outside Lucent Technologies. It is the storage management component of the High Performance Data Access (HPDA) testbed for military applications at Lockheed Martin Advanced Technology Laboratories. The Advanced Telecommunications/Information Distribution Research Program is using the HPDA framework for experimenting with quality-of-service mechanisms that utilize network performance feedback to improve bandwidth allocation. Initial experiments with DataBlitz indicate that it is highly reliable, easily integrated with other components, and that it provides developers with a range of design and implementation choices not available in other commercial database products. The next guiding principle of DataBlitz is enabling the creation of fault-tolerant applications. DataBlitz provides a multi-level transaction model that facilitates the production of high-concurrency indexing and storage structures. The DataBlitz system also includes support for recovery from process failure in addition to system failure, the use of codewords and memory protection to ensure the integrity of data stored in shared memory. 3. ARCHITECTURE In the DataBlitz architecture, the database consists of one or more database files, along with a special system database file. User data is stored in database files, while all data related to database support, such as log and lock data, is stored in the system database file. This enables storage allocation routines to be used uniformly for (persistent) user data as well as (non-persistent) system data like locks and logs. The system database file also persistently stores information about the database files in the system. Another key requirement for applications that expect to store all their data in main memory is consistency of response time. As pointed out in [GL92], once the I/0 bottlenecks of paging data in and out of the database are removed, other factors such as latching and locking dominate the cost of a database access. Since operating system semaphores are too slow due to the overhead of making system calls, DataBlitz implements its own latches in user space for speed. In order to ensure predictability of response times for user applications and their scalability in symmetric multiprocessor environments, DataBlitz pro- As shown in Figure 1, database files opened by a process are directly mapped into the address space of that process. In DataBlitz, either memory-mapped files or sharedmemory segments can be used to provide this mapping. 2 4. TRANSACTION MANAGEMENT IN DATABLITZ Transaction management in DataBlitz is based on principles of multi-level recovery [MHL+92, Lom92]. To our knowledge, DataBlitz is the only implementation of multi-level recovery for main-memory, and one of the few implementations of explicit multi-level recovery reported to date. Shared Memory Virtual Memory of Process 1 Locks Logs Virtual Memory of Process 2 System DB User Process 1 DB File 1 User Code Datablitz lib User Process 2 User Code Datablitz lib 5. EXPERIMENTAL DESIGN DB File N We used a real-world database containing information relating to telephone customers. There are approximately 250,000 records, each of size 200 bytes. Thus, the size of the database is approximately 50 MB. Each record has 17 attributes, of which three are numeric and the remaining are character strings (e.g., name, address). Our transactions consisted of a set of read and/or update operations. A read operation consists of looking up a customer record based on a given telephone number. An update operation consists of the same lookup used for a read, followed by a change to the customer's zip code. The experiments were performed on a 200 MHz Sun Ultra Sparc with l GB of RAM and running Solaris 2.5. Checkpoints and Logs Figure 1. Architecture of the DataBlitz System This feature precludes using virtual memory addresses as physical pointers to data (in database files), but allows database files to be resized easily. 3.1 LAYERS OF ABSTRACTION DataBlitz’s architecture is organized in multiple layers of abstraction to support the toolkit approach. Figure 2 illustrates this architecture. At the highest level, users can interact with the DataBlitz relational manager. The DataBlitz relational manager is a C++ class library interface to access data stored in relational tables. It provides support for table scans (via iterators) and simple project select joins. Below that level is the "heap-file/ indexing layer," which provides support for fixed-length and variable-length collections, as well as template-based indexing abstractions. DataBlitz has a number of user-tunable parameters. For our experiments, we had two additional servers running in the background which typical real-world applications would require — one was a checkpoint server which does a fuzzy checkpoint every 10 seconds and the other a cleanup server which checks for application process crashes. In our experiments, we found that the servers interfere minimally with user transactions. The experiments were conducted using the C++ template-based hash index and the relational manager. In addition, we also compared DataBlitz’s performance with that of a commercial disk-based DBMS. The buffer cache size parameter for the server was set to be large enough so that the entire database fit in the server's buffer. Also, prior to performing experiments on the commercial DBMS, we loaded the entire database into the server's buffer cache. Applications Relation Manger Heap File/Indexing Logging, Locking , Allocation, etc. 5.1 THROUGHPUT AS A FUNCTION OF OPERATIONS PER TRANSACTION Figure 2. Layers of Abstraction in DataBlitz Services for logging, locking, latching, multi-level recovery, and storage allocation are exposed at the lowest level. New indexing methods can be built on this layer, as can special-purpose data structures for either an application or a database management system. Read-only transactions. We begin by considering an offered load consisting solely of read-only transactions. The number of operations per transaction is varied from 10 to 500. Since transaction processing consists of the actual execution of the read operations plus the commit overhead for each transaction, we expect higher throughput for larger transaction size. The data in Figure 3(a) supports 3 80000 Hash Index Relational Commercial db 60000 40000 100000 80000 Operations/Sec Operations/Sec 100000 Hash Index Relational Commercial db 60000 40000 20000 20000 0 0 50 100 150 200 250 300 350 400 450 500 Operations/Trans (a) Read-Only Transation 50 100 150 200 250 300 350 400 450 500 Operations/Trans (b) Update-Only Transation Figure 3. Throughput as a Function of Operations per Transaction this. For sufficiently large transactions, the commit overhead is an insignificant fraction of overall execution time and thus the throughput curve levels off. The figure shows that the low-level hash-index interface offers roughly a factor of 3.5 gain in throughput. This is because the relational interface involves additional processing for checking attribute types, enforcing null semantics for attributes, and copying key values. In a main-memory setting with no disk I/0, these overheads constitute a fairly large percentage of the work involved in performing a lookup. As a result, the low-level interfaces can make a significant difference, and thus it is imperative to make low-level interfaces available to those applications that have the greatest throughput needs. Thus, since transfers to disk with bigger granularity can be performed more efficiently, there is a greater benefit to large transaction sizes in the update-only case as compared to the read-only case described earlier. The lookup performance of DataBlitz at the relational interface is more than 10 times that for the commercial DBMS; at lower levels, DataBlitz results in almost 35 times the throughput. These spectacular performance gains over the commercial DBMS can be attributed primarily to the following three factors: (1) lookups in DataBlitz do not have to interact with the buffer manager, (2) in DataBlitz, applications access data directly without any inter-process communication (in contrast, applications for the commercial DBMS have to communicate with a server), and (3) in DataBlitz, applications do not incur SQL related overhead. Contrary to what one would expect, DataBlitz also outperforms the commercial DBMS for update-only transactions. This is surprising since, for update transactions, both DataBlitz and the commercial DBMS have to flush the log to disk at transaction commit time. Thus, one would expect disk I/0 overhead to dominate transaction processing costs, and for these costs to be identical for both DataBlitz and the commercial DBMS. However, DataBlitz’s throughput at the relational interface is about 4 and 9 times that of the commercial DBMS at 100 and 500 operations per second, respectively. With DataBlitz’s hash index, we obtain even higher multiples for the throughput numbers. Once again, we see that the low-level hash-index interface offers greater throughput due to the lower processing overhead and smaller volume of log records. Further, because of the dominance of disk 1/0 cost at the lower level, the benefits of the low-level interface manifest themselves to a greater degree as the disk overhead per operation decreases — that is, as the transaction size increases. The difference is roughly a factor of 3 for transactions consisting of 500 update operations. Update-only transactions. Figure 3(b) shows the same data for update-only transactions. Although the slope of the curve decreases as transaction size increases, the effect is much less than for the read-only case. Here, the amount of work to be done at commit time is larger, since log records for the updates must be written to disk. This disk I/0 is the dominant cost. Larger transactions allow for more log records to be transferred in a single disk write. It is worth noting that while DataBlitz’s performance improves as the number of operations per transaction is increased, the commercial DBMS's performance stays essentially the same. Our conjecture is that each update operation involves a remote procedure call to the server, and the server flushes the log records for each of the transaction’s operations independently, rather than 4 batching them together in a single flush at transaction commit. REFERENCES [CDRS89] M. J. Carey, D. J. DeWitt, J. E. Richardson, and E. J. Shekita. Storage management for objects in EXODUS. In W. Kim and F. H. Lochovsky, editors, Object-Oriented Concepts and Databases. AddisonWesley, 1989. [GL92] Vibby Gottemukkala and Tobin Lehman. Locking and latching in a memory-resident database system. In Proc. of the Intl Conf. on Very Large Databases, pages 533-544, August 1992. [HCL+90] L. M. Haas, W. Chang, G. M. Lohman, J. McPherson, P. F. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. Carey, and E. Shekita. Starburst mid-flight: As the dust clears. IEEE Transactions on Knowledge and Data Engineering, 2(l), March 1990. [Lom92] D. Lomet. MLR: A recovery method for multi-level systems. In Proc. of A CM-SIGMOD Int'l Conference on Management of Data, pages 185-194, 1992. [MHL+92] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using writeahead logging. A CM Transactions on Database Systems, 17(l):94-162, March 1992. [SGM90] K. Salem and H. Garcia-Molina. System M: A transaction processing testbed for memory resident data. IEEE Transactions onKnowledge and Data Engineering, 2(l):161172, March 1990 ACKNOWLEDGMENTS DataBlitz represents a substantial effort over several years, to which many have contributed. We would like to thank H.V. Jagadish for significant early contributions to DataBlitz. Dan Lieuwen also provided significant early contributions, and has made several valuable suggestions concerning process failure and the organization of the heap file. S. Seshadri contributed to the T-tree concurrency algorithm and the relational interface. Steve Coomer suggested the strategy of using codewords to protect checkpoint images on disk. Dennis Leinbaugh suggested the structure of the coalescing allocator. We would also like to thank the following talented individuals who have contributed to the design and implementation of specific systems in Dali and/or DataBlitz: Yuri Breitbart, Soumya Chakraborty, Ajay Deshpande, Sadanand Cogate, Chandra Gupta, Sandeep Joshi, Peter McIlroy, Sekhara Muddana, Mike Nemeth, John Miller, James Parker, P.P.S. Narayan and Yogesh Wagle. *The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the Army Research Laboratory or the U.S. Government. 5