Download THE DATABLITZ™ MAIN-MEMORY STORAGE MANAGER

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia , lookup

Oracle Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Commitment ordering wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Serializability wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

ContactPoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Concurrency control wikipedia , lookup

Transcript
THE DATABLITZ™ MAIN-MEMORY STORAGE MANAGER:
ARCHITECTURE, PERFORMANCE, AND EXPERIENCE*
Jerry D. Baulier, Philip Bohannon, Amit Khivesara, Henry F. Korth†,
Rajeev Rastogi, Avi Silberschatz, and S. Sudarshan‡
Bell Laboratories
Lucent Technologies, Inc.
700 Mountain Ave., Murray Hill, NJ 07974
{jdb,bohannon,khivi,hfk,rastogi,avi,sudarsha}@lucent.com
Brian Sayrs**
Lockheed Martin Advanced Technology Laboratories
1 Federal Street, Camden, NJ 08102
ABSTRACT
General-purpose commercial database systems, though
widely used, fail to meet the performance requirements of
applications requiring short, predictable response times,
and extremely high throughput rates. As a result, most
high performance applications are custom designed and
lack the flexibility needed to adapt to unforeseen, evolving
requirements. In the military domain, command centers
often contain numerous “stovepipe” systems unable to
share data easily. The need for improved data
management is apparent with the rapid growth of
communication networks and the increasing demand by
end users for network-centric solutions that require
flexibility and high performance. These applications share
the need for real-time response to a dynamically changing
external environment; the need to store a substantial
amount of data; and the need to process transactions that
have the usual ACID guarantees of traditional database
systems.
The above considerations indicate a database system
design in which the data resides in main memory and disks
are used to store checkpoints and logs. While a
commercial database system can be adapted to this
environment by making a large buffer available in main
memory, it is possible to achieve significant gains in
performance and response time by designing a database
system tuned to this environment. In this paper, our focus
is on the storage manager component of a database
*Prepared through collaborative participation in the Advanced Telecommunications & Information Distributed Research Program (ATIRP) Consortium
sponsored by the U.S. Army Research Laboratory under the Federated
Laboratory Program, Cooperative Agreement DAAL01-96-2-0002. The U.S.
Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright notation thereon.
*Copyright @1998 Lucent Technologies, Inc.
†Contact author.
‡Current address: Department of Computer Science, IIT Bombay.
**Lockheed Martin was a Beta test site for the DataBlitz tool kit.
system. This component provides basic database functions
such as disk and memory management, transactions
(concurrency and recovery), and data access paths.
1. INTRODUCTION
Military applications that require real-time response to a
dynamically changing external environment cannot usually
tolerate the unpredictable delays associated with
commercial database systems. While a high degree of disk
parallelism can be used to achieve high rates of
throughput, disk-based systems cannot achieve predictable
response times in the tens of milliseconds. Main memory
is the only technology capable of these characteristics. In
the past, the high cost and size limitations of main memory
lead to the development of specialized memory and data
management systems which can be very costly to maintain
and difficult to extend. Furthermore, many military
applications operate in embedded environments (e.g.,
satellites, fighter jets, ships) that run real-time operating
systems and favor light-weight configurable database
systems with small footprint sizes.
2. BACKGROUND
While there has been much work on storage managers,
including Exodus [CDRS89] and Starburst [HCL+90],
there has been little work on main memory storage
managers until recently.1 Within the past several months,
substantial interest in main-memory storage managers has
emerged, as evidenced by such commercial systems as
TimesTen (based on the Hewlett Packard Smallbase
project), Angara, and the DataBlitz™ system. The current
1
System M [SGM90] is a transaction processing test-bed for
memory resident data, but is not a full feature storage manager.
practice in main-memory systems is to code specialpurpose, custom systems. This practice arises from the
lack (until recently) of viable commercial alternatives, and
from practical cost constraints mandating a minimalist
approach to the use of hardware resources.
vides support for fine-grained concurrency control at both
the lock (e.g., record level locking) and latch levels (e.g.,
for protecting system structures like the lock table), and
fuzzy checkpoints that minimally interfere with transaction
processing.
We have seen increasing interest in employing
general-purpose storage managers in place of existing
legacy systems. There are also numerous efforts in the
military to create sharable data models, such as the Army’s
Common Database (ACDB) program, which encourage
the use of general purpose storage managers. However, the
lack of predictable performance in existing database
systems has led to the development of special purpose
filter and replication mechanisms.
Other principles that have guided DataBlitz's
implementation are a toolkit approach and support for
multiple interface levels. The former implies, for example,
that logging facilities can be turned off for data that need
not be persistent, and locking can be turned off if data is
private to a process. The latter principle means that lowlevel index components are exposed to the user so that
critical application components can be optimized with
special implementations. Although application developers
may prefer the high-level relational interface for its ease of
use, our experiments indicate that low-level interfaces can
provide substantial performance gains. Also, exporting the
low-level interfaces enables DataBlitz to be customized to
meet specific application needs such as small executable
size.
DataBlitz is a storage manager for persistent data whose
architecture has been optimized for environments where
the database is main-memory resident. A number of principles have evolved with DataBlitz. The first principle is
direct access to data. DataBlitz uses a memory-mapped
architecture, where the database is mapped into the virtual
address space of the process, allowing the user to acquire
pointers directly to information stored in the database. A
related principle is no interprocess communication for
basic system services. All concurrency control and logging
services are provided via shared memory rather than
communication with a server. These principles contribute
significantly to improving performance by eliminating
from the execution path of data accesses (1) expensive
remote procedure calls, and (2) costly buffer manager
processing overhead.
DataBlitz has been in use at several beta sites, both within
and outside Lucent Technologies. It is the storage
management component of the High Performance Data
Access (HPDA) testbed for military applications at
Lockheed Martin Advanced Technology Laboratories. The
Advanced Telecommunications/Information Distribution
Research Program is using the HPDA framework for
experimenting with quality-of-service mechanisms that
utilize network performance feedback to improve
bandwidth allocation. Initial experiments with DataBlitz
indicate that it is highly reliable, easily integrated with
other components, and that it provides developers with a
range of design and implementation choices not available
in other commercial database products.
The next guiding principle of DataBlitz is enabling the
creation of fault-tolerant applications. DataBlitz provides a
multi-level transaction model that facilitates the production
of high-concurrency indexing and storage structures. The
DataBlitz system also includes support for recovery from
process failure in addition to system failure, the use of
codewords and memory protection to ensure the integrity
of data stored in shared memory.
3. ARCHITECTURE
In the DataBlitz architecture, the database consists of one
or more database files, along with a special system
database file. User data is stored in database files, while all
data related to database support, such as log and lock data,
is stored in the system database file. This enables storage
allocation routines to be used uniformly for (persistent)
user data as well as (non-persistent) system data like locks
and logs. The system database file also persistently stores
information about the database files in the system.
Another key requirement for applications that expect to
store all their data in main memory is consistency of
response time. As pointed out in [GL92], once the I/0
bottlenecks of paging data in and out of the database are
removed, other factors such as latching and locking
dominate the cost of a database access. Since operating
system semaphores are too slow due to the overhead of
making system calls, DataBlitz implements its own latches
in user space for speed. In order to ensure predictability of
response times for user applications and their scalability in
symmetric multiprocessor environments, DataBlitz pro-
As shown in Figure 1, database files opened by a process
are directly mapped into the address space of that process.
In DataBlitz, either memory-mapped files or sharedmemory segments can be used to provide this mapping.
2
4. TRANSACTION MANAGEMENT IN
DATABLITZ
Transaction management in DataBlitz is based on
principles of multi-level recovery [MHL+92, Lom92]. To
our knowledge, DataBlitz is the only implementation of
multi-level recovery for main-memory, and one of the few
implementations of explicit multi-level recovery reported
to date.
Shared Memory
Virtual
Memory of
Process 1
Locks
Logs
Virtual
Memory of
Process 2
System DB
User Process 1
DB File 1
User Code
Datablitz lib
User Process 2
User Code
Datablitz lib
5. EXPERIMENTAL DESIGN
DB File N
We used a real-world database containing information
relating to telephone customers. There are approximately
250,000 records, each of size 200 bytes. Thus, the size of
the database is approximately 50 MB. Each record has 17
attributes, of which three are numeric and the remaining
are character strings (e.g., name, address). Our transactions
consisted of a set of read and/or update operations. A read
operation consists of looking up a customer record based
on a given telephone number. An update operation consists
of the same lookup used for a read, followed by a change
to the customer's zip code. The experiments were
performed on a 200 MHz Sun Ultra Sparc with l GB of
RAM and running Solaris 2.5.
Checkpoints and Logs
Figure 1. Architecture of the DataBlitz System
This feature precludes using virtual memory addresses as
physical pointers to data (in database files), but allows
database files to be resized easily.
3.1 LAYERS OF ABSTRACTION
DataBlitz’s architecture is organized in multiple layers of
abstraction to support the toolkit approach. Figure 2
illustrates this architecture. At the highest level, users can
interact with the DataBlitz relational manager. The
DataBlitz relational manager is a C++ class library
interface to access data stored in relational tables. It
provides support for table scans (via iterators) and simple
project select joins. Below that level is the "heap-file/
indexing layer," which provides support for fixed-length
and variable-length collections, as well as template-based
indexing abstractions.
DataBlitz has a number of user-tunable parameters. For
our experiments, we had two additional servers running in
the background which typical real-world applications
would require — one was a checkpoint server which does
a fuzzy checkpoint every 10 seconds and the other a
cleanup server which checks for application process
crashes. In our experiments, we found that the servers
interfere minimally with user transactions.
The experiments were conducted using the C++
template-based hash index and the relational manager. In
addition, we also compared DataBlitz’s performance with
that of a commercial disk-based DBMS. The buffer cache
size parameter for the server was set to be large enough so
that the entire database fit in the server's buffer. Also, prior
to performing experiments on the commercial DBMS, we
loaded the entire database into the server's buffer cache.
Applications
Relation Manger
Heap File/Indexing
Logging, Locking , Allocation, etc.
5.1 THROUGHPUT AS A FUNCTION OF
OPERATIONS PER TRANSACTION
Figure 2. Layers of Abstraction in DataBlitz
Services for logging, locking, latching, multi-level
recovery, and storage allocation are exposed at the lowest
level. New indexing methods can be built on this layer, as
can special-purpose data structures for either an
application or a database management system.
Read-only transactions. We begin by considering an
offered load consisting solely of read-only transactions.
The number of operations per transaction is varied from 10
to 500. Since transaction processing consists of the actual
execution of the read operations plus the commit overhead
for each transaction, we expect higher throughput for
larger transaction size. The data in Figure 3(a) supports
3
80000
Hash Index
Relational
Commercial db
60000
40000
100000
80000
Operations/Sec
Operations/Sec
100000
Hash Index
Relational
Commercial db
60000
40000
20000
20000
0
0
50 100 150 200 250 300 350 400 450 500
Operations/Trans
(a) Read-Only Transation
50 100 150 200 250 300 350 400 450 500
Operations/Trans
(b) Update-Only Transation
Figure 3. Throughput as a Function of Operations per Transaction
this. For sufficiently large transactions, the commit
overhead is an insignificant fraction of overall execution
time and thus the throughput curve levels off. The figure
shows that the low-level hash-index interface offers
roughly a factor of 3.5 gain in throughput. This is because
the relational interface involves additional processing for
checking attribute types, enforcing null semantics for
attributes, and copying key values. In a main-memory
setting with no disk I/0, these overheads constitute a fairly
large percentage of the work involved in performing a
lookup. As a result, the low-level interfaces can make a
significant difference, and thus it is imperative to make
low-level interfaces available to those applications that
have the greatest throughput needs.
Thus, since transfers to disk with bigger granularity can be
performed more efficiently, there is a greater benefit to
large transaction sizes in the update-only case as compared
to the read-only case described earlier.
The lookup performance of DataBlitz at the relational
interface is more than 10 times that for the commercial
DBMS; at lower levels, DataBlitz results in almost 35
times the throughput. These spectacular performance gains
over the commercial DBMS can be attributed primarily to
the following three factors: (1) lookups in DataBlitz do not
have to interact with the buffer manager, (2) in DataBlitz,
applications access data directly without any inter-process
communication (in contrast, applications for the
commercial DBMS have to communicate with a server),
and (3) in DataBlitz, applications do not incur SQL related
overhead.
Contrary to what one would expect, DataBlitz also
outperforms the commercial DBMS for update-only
transactions. This is surprising since, for update
transactions, both DataBlitz and the commercial DBMS
have to flush the log to disk at transaction commit time.
Thus, one would expect disk I/0 overhead to dominate
transaction processing costs, and for these costs to be
identical for both DataBlitz and the commercial DBMS.
However, DataBlitz’s throughput at the relational interface
is about 4 and 9 times that of the commercial DBMS at
100 and 500 operations per second, respectively. With
DataBlitz’s hash index, we obtain even higher multiples
for the throughput numbers.
Once again, we see that the low-level hash-index interface
offers greater throughput due to the lower processing
overhead and smaller volume of log records. Further,
because of the dominance of disk 1/0 cost at the lower
level, the benefits of the low-level interface manifest
themselves to a greater degree as the disk overhead per
operation decreases — that is, as the transaction size
increases. The difference is roughly a factor of 3 for
transactions consisting of 500 update operations.
Update-only transactions. Figure 3(b) shows the same
data for update-only transactions. Although the slope of
the curve decreases as transaction size increases, the effect
is much less than for the read-only case. Here, the amount
of work to be done at commit time is larger, since log
records for the updates must be written to disk. This disk
I/0 is the dominant cost. Larger transactions allow for
more log records to be transferred in a single disk write.
It is worth noting that while DataBlitz’s performance
improves as the number of operations per transaction is
increased, the commercial DBMS's performance stays
essentially the same. Our conjecture is that each update
operation involves a remote procedure call to the server,
and the server flushes the log records for each of the
transaction’s operations independently, rather than
4
batching them together in a single flush at transaction
commit.
REFERENCES
[CDRS89] M. J. Carey, D. J. DeWitt, J. E. Richardson,
and E. J. Shekita. Storage management for objects in
EXODUS. In W. Kim and F. H. Lochovsky, editors,
Object-Oriented Concepts and Databases. AddisonWesley, 1989.
[GL92]
Vibby Gottemukkala and Tobin Lehman.
Locking and latching in a memory-resident database
system. In Proc. of the Intl Conf. on Very Large
Databases, pages 533-544, August 1992.
[HCL+90] L. M. Haas, W. Chang, G. M. Lohman, J.
McPherson, P. F. Wilms, G. Lapis, B. Lindsay, H.
Pirahesh, M. Carey, and E. Shekita. Starburst mid-flight:
As the dust clears. IEEE Transactions on Knowledge and
Data Engineering, 2(l), March 1990.
[Lom92]
D. Lomet. MLR: A recovery method for
multi-level systems. In Proc. of A CM-SIGMOD Int'l
Conference on Management of Data, pages 185-194, 1992.
[MHL+92] C. Mohan, D. Haderle, B. Lindsay, H.
Pirahesh, and P. Schwarz. ARIES: A transaction recovery
method supporting fine-granularity locking and partial
rollbacks using writeahead logging. A CM Transactions on
Database Systems, 17(l):94-162, March 1992.
[SGM90]
K. Salem and H. Garcia-Molina. System M:
A transaction processing testbed for memory resident data.
IEEE Transactions onKnowledge and Data Engineering,
2(l):161172, March 1990
ACKNOWLEDGMENTS
DataBlitz represents a substantial effort over several years,
to which many have contributed. We would like to thank
H.V. Jagadish for significant early contributions to
DataBlitz. Dan Lieuwen also provided significant early
contributions, and has made several valuable suggestions
concerning process failure and the organization of the heap
file. S. Seshadri contributed to the T-tree concurrency
algorithm and the relational interface. Steve Coomer
suggested the strategy of using codewords to protect
checkpoint images on disk. Dennis Leinbaugh suggested
the structure of the coalescing allocator. We would also
like to thank the following talented individuals who have
contributed to the design and implementation of specific
systems in Dali and/or DataBlitz: Yuri Breitbart, Soumya
Chakraborty, Ajay Deshpande, Sadanand Cogate, Chandra
Gupta, Sandeep Joshi, Peter McIlroy, Sekhara Muddana,
Mike Nemeth, John Miller, James Parker, P.P.S. Narayan
and Yogesh Wagle.
*The views and conclusions contained in this document are those of the
authors and should not be interpreted as representing the official policies,
either expressed or implied of the Army Research Laboratory or the U.S.
Government.
5