Download Exploiting parallelism for the performance enhancement of non

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Exploiting parallelism for the performance enhancement of
non-numeric applications
by DAVID J. DEWITT and DINA FRIEDLAND
University of Wisconsin
Madison, Wisconsin
ABSTRACT
In this paper we examine the design of computer architectures that use parallelism
to enhance the performance of non-numeric applications. In particular we examine
how the technology of mass-storage devices has affected the design of computer
architectures for non-numeric processing.
207
From the collection of the Computer History Museum (www.computerhistory.org)
From the collection of the Computer History Museum (www.computerhistory.org)
Exploiting Parallelism for Non-Numeric Applications
209
INTRODUCTION
OPERATIONS OF NON-NUMERIC PROCESSING
In this paper we examine the design of computer architectures
that use parallelism to enhance the performance of nonnumeric applications. While this has been an active area of
research for over ten years,l no commercial products that
exploit parallelism have resulted from these efforts. This is
especially interesting when one considers the wide range of
products available commercially for numeric applications that
use parallelism. At the top-end (in terms of both performance
and price) are large scientific processors such as the Cray-1.
At the bottom end (in terms of price) are products such as the
Floating Point Systems attached pipelined processor.
Why is this the case? There seem to be several possible
explanations. The first is that the problems associated with
parallelism for non-numeric computation have not received
the same amount of attention as the numeric area. Considering the large number of papers on "database machines," this
is not a very plausible explanation. There are, however, two
plausible ones. The first is that while there has been a constant
demand for "more cycles" for numeric computations from a
variety of user groups (e.g., geologists at oil companies doing
seismic data analysis, physicists at Lawrence Livermore Laboratory solving large sets of simultaneous differential equations), a similar market has never developed for non-numeric
applications. Another feasible explanation is that the recent
advances in technology have not really helped make highly
parallel non-numeric processors economically viable.
In this paper we intend to present and discuss a number of
alternative architectures that have been proposed for nonnumeric computation. While we will not present a performance evaluation of these architectures, we have performed a number of such studies recently. 2,3,4,5 Instead, we
intend to summarize the results from these studies in describing the problems associated with the architectures that
have been proposed. We begin by characterizing those operations that must be efficiently supported by a non-numeric
processor (regardless of whether parallelism is used). The
impact of limited 110 bandwidth on parallel architectures for
non-numeric processing is discussed in the third section. In
the fourth and fifth sections, we describe a number of architectures that have been proposed to increase the total 110
bandwidth in the system. A class of architectures referred to
as "on-the-disk" machines are described first. Processing
"complex" non-numeric operations such as sorting requires a
different type of machine. We discuss several alternative organizations for machines of this type in the fifth section. Finally,
we present our conclusions and suggestions for future
research.
Non-numeric databases can be naturally divided into two
classes: (1) those that contain unformatted documents such as
law cases and (2) those that contain formatted data such as
inventory records. In this section we will describe the types of
operations that must be supported by non-numeric processors
for databases of both types.
Unformatted data-text retrieval systems
Processing unformatted data is both simple and complex. It
is simple in the sense that what is required is to make a
sequential pass through a database searching for those records
(documents) that have a certain property. For example, in an
online medical database containing all research reports dealing with·cancer, a researcher might ask a text retrieval system
to retrieve all documents that mention the word "liver."
Processing queries can become significantly more complex,
however. Consider the query "retrieve those documents that
mention the liver and contain a reference to either the chemical 'carbon tetrachloride' or 'perchloroethylene' 0" Processing
this query is difficult because the reference to one of these
chemicals might appear either before or after the reference to
the liver and, in general, the references may be separated by
an arbitrary amount of text. Furthermore, the system must be
aware of a number of different delimiters (e.g., paragraphs,
chapters, etc.), so that it does not retrieve a document when
the query is satisfied by strings outside of the appropriate
context.
The number of documents that must be searched to process
a query can be reduced by maintaining secondary indices6 on
certain key words or phrases such as "organ type." However,
simply storing these indices can become prohibitive when indices must be maintained for a large number of different
keywords. Unless users access the database using only a limited number of key words (i.e., organ names or chemical
names), secondary indices are not viable. Thus, since processing a query involves primarily a linear scan of the database, application of parallel processing techniques appears to
be a very promising approach to providing fast access to very
large unformatted databases. In the section on search machines, we will describe several architectures that are well
suited for processing unformatted queries.
Formatted data-database systems
While a number of different data models have been proposed for describing the logical organization of a formatted
From the collection of the Computer History Museum (www.computerhistory.org)
210
National Computer Conference, 1982
database,6 most researchers who have designed special purpose architectures for enhancing database system performance have assumed the relational data model. This has
occurred primarily because of the regUlarity of the relational
data model for specifying information on both entities (e.g.,
parts, suppliers) and relationships between entities (e.g.,
which parts a supplier supplies).
A relational database7 consists of a number of normalized
relations. Each relation is characterized by a fixed number of
attributes and contains an arbitrary number of unique tuples.
Thus, a relation can be viewed as a two-dimensional table in
which the attributes are the columns and the tuples are the
rows. In a relational DBMS, relations are used to describe
both entities and relationships between entities. Figure 1
shows a simple database that describes information about suppliers (relation S), parts (relation P), and the association
between suppliers and the parts they supply (relation SP).
The operations supported by most relational database sys-
tems can be divided into two classes according to the time
complexity of the algorithms used on a uni-processor system.
The first class includes those operations that reference a single
relation and require linear time (i.e., -they can be processed in
a single pass over the relation). The most familiar example is
the selection operation which selects those tuples from a relation that satisfy a simple predicate (e.g., suppliers in "New
y;ork").
The second class contains operations that have either one or
two input relations and require nonlinear time for their execution. An example of a relation in this class that references
one relation is the projection operation. Projecting a relation
involves first eliminating one or more attributes (columns) of
the relation and then eliminating any duplicate tuples that
may have been introduced by the first step. Sorting [which
requires 0 (nlogn) time] is the generally accepted way of
eliminating the duplicate tuples. The join operation is the
most frequently used operation from this class that references
two relations. This operation can be viewed as a restricted
cross-product of two relations. A join would be used by a user
to find the name and address of all suppliers who supply a
particular part (see Figure 1).
S RELATION
Si
SUPPLIER-NAME
17
The problem of size
ADDRESS
JONES
MADISON
P RELATION
Pi
PART-NAME
36
STOVE
COLOR
WEIGHT
YELLOW
300
SP RELATION
Si
pi
QUANTI TY-ON-HAND
17
36
13
1--.
Figure I-Supplier-parts relational database
The very size of both formatted and unformatted data sets
introduces two important problems. First, operations on both
types of database are almost always I/O and not CPU intensive. For certain complex search conditions on both unformatted and formatted databases, a rather fast CPU may be
required to process the query at the speed of the mass-storage
device. However, as we will discuss in next section, during the
last ten years improvements in technology have had a much
more dramatic effect on performance of low-cost processing
units (e.g., the Motorola 68000) than the bandwidth of I/O
devices. We contend that even this type of query should be
considered to be I/O limited rather than CPU limited.
The second effect of very large data sets is that the algorithms used must be "external" algorithms. For example, in
designing a machine that can rapidly sort very large data files,
one must never make the assumption that enough main
memory will be available to permit the whole file to be
brought into main memory and sorted. For a single processor,
the algorithm normally used is an external merge sort. 8 It is
important to realize that the requirement that external algorithms be used has the side effect of increasing I/O activity.
Later we will examine several architectures that employ parallelism for processing complex operations in order to minimize
the number of I/O operations performed.
The algorithms for processing selection operations on formatted or unformatted databases are naturally external algorithms as the normal mode of execution is to read the next
block of data from mass storage and then apply the search
condition to it. Since these queries can always by processed in
one pass through the database, they are, when measured in
terms of I/O activity, significantly simpler than sorting a very
large data file. Thus a different class of architectures that
exploit parallelism are appropriate. These architectures will
be discussed in the fourth section.
From the collection of the Computer History Museum (www.computerhistory.org)
Exploiting Parallelism for Non-Numeric Applications
THE PROBLEM OF I/O BANDWIDTH
The key factor in limiting development of parallel processors
for non-numeric computation is the lack of sufficient I/O bandwidth from commercially available mass-storage devices.
Consider, for example, a typical (for 1981) mass-storage device such as the IBM 3350. Each track on this device holds
19,069 bytes, and a revolution takes 16.7 ms. Thus, the maximum burst bandwidth of the device is 1.144 Mbytes/second.
This represents an improvement of only 46% in I/O bandwidth
when compared with the IBM 3330-a drive that was introduced over ten years ago.
As was discussed briefly in the previous section, executing
a sequential search on a formatted or unformatted database is
more likely to be. CPU limited than executing an external
algorithm for a complex operation (e.g., a join) on a formatted database, since the first task requires only N I/O operations (one for each of the N blocks of data) while the latter
requires, in general, on the order of NlogN I/O operations. It
is thus most appropriate to examine the level of CPU performance necessary to process data at the rate of 1.144
Mbytes/second.
At a data rate of 1.144 Mbytes/second, a processor that is
directly attached to a disk controller has approximately 0.87
microseconds to process each incoming byte. If one assumes
. that it takes two instructions to process each byte (one instruction both to compare the next byte of the incoming data
stream with the next byte of the search pattern and to do an
autoincrement of both pointers, and a second instruction to
test for both loop termination and branch on failure), then the
processor must be approximately a 2.3 MIP processor. While
this is significantly faster than a conventional microprocessor
such as a Motorola 68000 (which is about a 1 MIP processor),
it is certainly within the range of a simple processor constructed using bipolar bit-slice technology. If one insists on
using "off the shelf' hardware, one might use three 68000 type
processors each with an associated track-long buffer to process the incoming stream of data. Since each processor would
now have to process only every third track, the resulting performance should be more than adequate.
For formatted databases, every byte of the incoming data
stream does not need to be examined, since the offset of the
fields to be examined are located at known positions in the
incoming records. Thus, a single processor with two tracklong buffers would be able to process selection queries at the
data rate of an IBM 3350 disk. While the disk is filling one
buffer, the processor can be applying the selection criterion to
the records in the other buffer. Processing a record involves
simply indexing into the record (by adding the offset of the
field in the record) and then applying the selection condition
to the field. As long as the number of bytes to be processed
does not exceed approximately one-third of the record, one
processor should be able to process the records as fast as they
are read off the disk.
- The point that we are trying to make with these examples
is that one needs, at most, two to three processors and not tens
or hundreds to process data at the rate of present conventional
disk drives. Thus, an architecture that blindly uses a lot of
processors (no matter how they are interconnected) to process
data that resides on few standard disk drives will inevitably be
211
I/O bound. Advances in disk technology hold only limited
hope for a resolution of this problem. Over the past ten years,
increased disk capacities have been mainly the result of an
increased number of tracks per surface, rather than an increased number of bytes per track. It is only with the recent
announcement of the IBM 3380 disk drive that this situation
has changed significantly. The 3380, with a revolution time of
16.7 ms. and a track capacity of 47,476 bytes, has a burst data
rate (assuming no seeks) of 2.85 Mbytes/second. While this is
a significant improvement, it still limits the number of processors that can be effectively used to the range 6-8.
In the next two sections, we will examine several architectures that have been proposed for increasing the effective
I/O bandwidth. In particular, we show that the architectural
solutions for processing selection operations are different
than those for processing complex operations on formatted
databases.
SEARCH MACHINES--"ON THE DISK MACHINES"
Architectures for parallel processing of selection operations
on formatted or unformatted data consist of two basic components: (1) a processing element and (2) a storage unit. A
number of different storage technologies can be used as the
basis of the storage unit, including bubble memories, chargecoupled devices, or conventional magnetic disks. Since the
logical organization of the processing elements is independent
of the storage medium used, we shall assume that the storage
unit is a track on a magnetic disk drive and that records are
stored bitwise along the track.
There are two radically different approaches for designing
the search engine. The first is to use a "conventional" processor such as a Motorola 68000 or a simple processor constructed using bit-slice components. The query to be executed
is first compiled (by some host processor) into the machine
language (or microcode) of the processor. Next the compiled
query is loaded into the memory of the processor and then
executed by applying the query to the storage unit associated
with the processing element.
An alternative approach is to construct a special purpose
processor9, 10 that behaves like a finite state automaton (FSA).
In this approach the query to be processed is compiled into a
state-transition matrix that is then loaded into the memory of
the FSA. For each incoming byte, the FSA computes a next
state based on its current state and the value of the incoming
byte. Since processing any query requires only one transition
for each byte in the data stream, processing an arbitrarily
complex query can always be done at the speed of the data
stream. The main disadvantage of this approach is that the
state-transition matrix that represents the compiled query is
rather large [number of rows (states) * number of different
incoming symbols].
In the following sections we shall describe three ways of
organizing processing elements and storage units. For each
design, we have assumed that the processing elements are
connected to a host processor. This processor serves two important functions. First, it accepts and compiles queries from
the users of the system. Second, as we shall discuss below, it
can be used to assist in the execution of certain queries that
are too complex for the processing elements to handle alone.
From the collection of the Computer History Museum (www.computerhistory.org)
212
National Computer Conference, 1982
Processor-per-track (PPT) machines
In the first class of search machines, each storage unit has
its own processing element associated with it as shown in
Figure 2. The processing element scans the data as the track
rotates and places selected records in an output buffer associated with the head. After a buffer fills, additional logic attempts to place its contents on the output bus for transmission
to the host. In the event that the processing logic is not able
to output a selected record (because the bus is busy and the
temporary storage buffers are full), processing is suspended
until at least one buffer is emptied.
The PPT architecture appears to be feasible only if the
capacity of each storage unit is much larger (with a proportionally longer revolution time). The consequences would be
twofold. First, the number of units required to store a "large"
database could probably be reduced to a "reasonable" level.
Second, a reduction in the number of processing units would
ease the output bus bottleneck problem. There are two techniques of increasing storage unit capacities. One is to use a
different storage medium such as bubble memories (some of
which have a capacity of one million bits each l7). While the
performance of such an approach is excellent,2 most manufacturers are dropping bubble memory components because
their cost per bit has not become competitive. A second approach is to associate each processing element not with a
single track of a disk, but rather with an entire surface of a
disk. This approach is discussed in the following section.
HOST
PROCESSOR
PROCESSING
}
ELEMENTS
} S~ORAGE
UNITS
DATABASE
Figure 2-PPT design
PPT machines were pioneered in 1970 by Slotnick,1 who
suggested using first a track of a fixed head disk as the unit of
storage and includes machines proposed by Parker, 11 Minsky,12 and Parhami.13 Two early database machine designs,
RAp I4 , 15 and CASSM,16 also belong to this class. Two versions of RAP, each having two processing elements and two
storage units, were constructed using first a fixed head disk
and then charge coupled devices.
Because the entire database can be searched in one revolution of the storage units, this architecture is, at first glance,
very appealing. However, it has a number of serious flaws.
First, if magnetic disk technology is used as the basis of the
storage unit, an extremely large number of storage units and
cells are needed. Assume, for example, a state-of-the-art
track capacity of 50,000 bytes. To have the same storage capacity as a conventional 300 Mbyte disk drive (which can hold
only a very small database), 6,000 storage units and processing elements would be needed. The resulting system
would most likely be relatively unreliable. A second major
problem with this approach is that the bus connecting these
6000 processing elements to the host can easily become a
bottleneck. Assume that the bandwidth of the output bus is 10
Mbytes/second. If the storage units have a revolution time of
16.67 ms., the maximum aggregated bandwidth of the processing elements is 18 billion bytes/second. Even if only 0.1 %
of the data stored in each storage element satisfies the search
condition, the processing elements will have an output bandwidth of 18 Mbytes/second and hence the bus wiU stili be a
bottleneck.
Processor-per-head (PPH) machines
In the second class of search machines, each processing
element is associated with a head of a moving-head disk as
illustrated in Figure 3. Thus, the storage unit consists of all the
tracks on the surface of a disk instead of a single track. In this
class of machines, data is transferred in parallel over I-bit
wide data lines from the heads to the processing elements.
Each processor applies the selection criteria to its incoming
data stream and places selected records in its output buffer, In
such an organization, an entire cylinder of a moving-head disk
is examined in a single revolution (assuming no output bus
contention). As in PPT organizations, additional revolutions
may be needed to complete execution of the query if an output buffer overflows.
The DBC18 database machine project adopted the PPH
approach over the PPT approach as the basis for the design of
the "Mass Memory Unit" because PPT devices were not
deemed to be cost-effective for the storage of large databases
(say more than 1010 bytes). Another possible reason for taking
this route is the apparent lack of success of head-per-track
disks as secondary storage devices. Moving-head disks with
parallel readout, on the other hand, seemed an attractive and
feasible alternative. The Technical University of Braun-
CONTROL
HOST
PROCESSOR
PROCESSING
ELEMENTS
Figure 3-PPH design
From the collection of the Computer History Museum (www.computerhistory.org)
Exploiting Parallelism for Non-Numeric Applications
schweig, in cooperation with Siemens, has actually built one
for use in the Braunschweig search machine SURE.19 It is the
case, however, that parallel readout disks are presently not
widely available (a 600-Mbyte drive with a 4-track parallel
readout capability and a data transfer rate of 4.84 Mbytesl
second is, however, available for the Cray-l for approximately
$80,000 without controller) and may never be cost-effective
for the storage of large databases (when compared to
"stan~ard" drives).
213
processor "join" each tuple in S with all the tuples from R
returned by the execution of its subquery. The performance of
this approach, however, is very poor. 3 In fact, it is inferior to
using only the host processor to perform a traditional
"sort-merge" join. In the next section, we will describe several
non-numeric architectures that are designed to process complex database operations effectively.
MACHINES FOR COMPLEX DATABASE
OPERATIONS
Processor-per-disk (P P D) machines
Unlike the PPT and PPH approaches, the PPD organization utilizes a standard disk drive. In this organization, a processing element is placed between the disk and the memory
device to which the selected records are to be transferred as
shown in Figure 4. This processor acts as a filter 9 to the disk
by forwarding to the host only those records that match the
selection criteria. At first glance, it seems as though this approach is so inferior to the others that it does not merit any
attention. However, its advantage is that for a relatively low
price one can obtain the same functionality (but not the same
performance) as the PPT and PPH designs. In addition, by
broadcasting the data stream from the disk to a set of processing elements,19 one can simultaneously process selection
queries from different users over the same database.
Use of Search Machines for Complex Database Operations
Since the only functionality provided by the PPT, PPH, and
PPD designs is to process selection queries, each of these
designs processes complex operations (e.g., join) on formatted databases by decomposing the query using an algorithm based on Wong's tuple substitution algorithm.20 Assume that the join operation as specified by the user has the
form R. a = S. b and that S contains fewer tuples than R. These
designs will process this query by issuing one selection subquery for each tuple in S. The form of each subquery will be
R. a = x where x is the join attribute value from the current
tuple in S. The result relation is produced by having the host
HOST
PROCESSOR
Parallelism can also be employed to enhance the performance
of complex database operations. The use of parallel processors to enhance the performance of the relational join and
project operations is highly desirable, since both are very
time-consuming in a conventional database management
system.
As discussed by De Witt and Hawthorn3 and Boral, De Witt,
Friedland and Wilkinson, 21 we strongly advocate an "algorithmic approach" to the design of architectures that use parallelism to enhance the performance of non-numeric operations. As an example of a complex operation, we present the
relational join operator in the following section. Afterwards,
we discuss four alternative building blocks that can be used as
the basis for parallel algorithms for complex formatted database operations. Then we describe parallel architectures
based on two of these building blocks. Because a performance
analysis of these alternative algorithms and architectures is
beyond the scope of this paper, the interested reader is encouraged to examine Boral, DeWitt, Friedland and Wilkins21
and Friedland. 4
Unlike the selection operation; for which a partition of the
data can be searched independently by each processor, the
join, projection, and the other complex operations require
that the processors exchange data among themselves during
intermediate stages of execution. In addition, the volume of
I/O activity is significantly higher, since unlike the selection
operation, these operations cannot be performed in a single
pass over the operand relations. While the building blocks
vary in terms of the degree of interprocessor communication
and amount of I/O activity each requires, the architectural
features that are needed to efficiently execute these tasks
include a fast interprocessor communication facility and a
cost-effective, mass-storage device that provides high I/O
bandwidth.
PROCESSING
ELEMENT
"DISK CONTROLLER
Figure 4-PPD design
The relational join operation
In Figure 5 the execution of the join operation is illustrated
by the "nested-loops" algorithm. Relations Rand S are assumed to have Nand M tuples respectively, and fields r of
relation Rand s of relation S are assumed to be of the same
type. Execution of the join requires that the r field from each
tuple of R be compared to the s field from each tuple of S.
When they match, the two tuples are concatenated. In general, each tuple in R will match an arbitrary number of tuples
in S. In the worst case, the join of Rand S will produce N* M
tuples.
From the collection of the Computer History Museum (www.computerhistory.org)
214
National Computer Conference, 1982
for i <- 1 to N do
for j <- 1 to M do
if R .. r = 5 .. s then
begin
]
t = join(R!,S.!);
output(t) 1.1.- J
end;
Figure 5-The nested-loops join algorithm
Building blocks for complex database operations
While the nested-loops join algorithm is adequate for
"small" relations, it is a very slow method joining two "large"
relations. If indices exist on the join attributes for both relations, then the join can instead be performed on the index
entries (instead of the tuples themselves).22, 23 This procedure
substantially reduces the volume of 110 activity, since the index files are much smaller than the data files. If indices are not
available, the join of two "large" relations is usually performed by presorting the operand relations. This process is
followed by a modified merge phase in which matching tuples
are located and joined. After the relations have been sorted,
the actual join step requires only a linear pass through both
relations. Sorting is also used by conventional database management systems to efficiently perform the duplicate elimination part in the projection. Alternatively, hashing24 can be
employed to eliminate duplicate tuples in a projected relation.
There appear to be four building blocks that can be used as
the basis for parallel algorithms (and parallel machines to
execute these algorithms) for complex database operations.
They are
1.
2.
3.
4.
Indexing
Hashing
Sorting
Broadcasting
Of these four techniques, sorting and broadcasting appear to
us to be the most promising. In the following paragraphs we
will describe a number of unresolved problems associated
with using indices and hashing as basic building blocks. In
both cases we shall use the join operation as a vehicle for
explaining our objections.
The basic idea of using indexing2S in a parallel algorithm for
the join operation is to have both the indices and the corresponding relations uniformly partitioned among the processors, each of which has a mass-storage device associated
with it. During execution of the algorithm, each processor
uses its portion of the indices to compute its portion of the
result relation. The fundamental flaw of this approach is that
it assumes that if you uniformly distribute the tuples from both
relations among all the processors, the two tuples from Rand
S with the same attribute value will end up on the same
processor.
Goodman and Despain2s also propose using hashing as the
basis for a parallel join algorithm. Again the two relations are
assumed initially to be uniformly distributed among all the
processors. The algorithm begins by applying a hash function
to the "joining" attribute of both relations. This hash function
is used to compute a processor number. After the appropriate
processor has been identified, the tuple is sent to the processor to be joined with the tuples with the same attribute
value. During the second phase of the algorithm, each processor joins the tuples it receives as the result of the first
phase.
There are a number of problems associated with this algorithm. First note that the number of distinct join attribute
values will, in general, be much larger than the number of
processors available. Thus, each processor will receive tuples
with a range of join attribute values. As a consequence, the
processor must somehow (e.g., by sorting) determine exactly
to which tuples every tuple is to be joined. A second problem
is the cost of moving almost every tuple to a different processor from where it originally resided. While both of these
problems have solutions, a serious problem that is impossible
to resolve remains. The idea of using a hash function in the
first place was to distribute the tuples in a manner that would
permit all processors to help perform the actual join operation. However, when there are only a few join attribute
values, only a few processors will be used to actually perform
the join. In general, since the distribution of join attribute
values is unknown, the performance of a machine designed
around this algorithm will be unpredictable (an unsettling
fact).
An architecture based on parallel sorting
Until recently,4 parallel sorting had not been suggested or
investigated as the basis for performing complex database
operations. This can be partly explained by the lack of research in the parallel external sorting area. While extensive
literature on parallel sorting exists, no algorithms had been
developed for sorting in parallel large mass-storage files.
When one considers how successful sorting is for performing
complex operations on a single processor, it is natural also to
examine_ its use for parallel architectures.
Friedland4 has examined and analyzed a number of external
parallel sorting algorithms and architectures. The algorithms
that display the best performance are based on either a twoway external merge step26 or an extension of Batcher's bitonic
sort27 to permit the sorting of external data files. In general
case, the algorithm based on the two-way external merge step,
termed the parallel binary merge algorithm, has the best performance. In this section we will describe the operation of this
algorithm on an architecture that permits the efficient execution of the algorithm. Use of this algorithm for performing
the join would occur in two steps. First, both relations would
be sorted on the joining attribute. Next, one processor would
be used to join the two relations by reading the sorted files in
a sequential pass through them. Tuples from the two relations
with equal joining attribute values would then be joined.
Execution of the parallel binary merge algorithm is divided
into three stages as shown in Figure 6~ Wea~~~me that th~.
number of pages to be sorted, N, is at least twice the number
of processors, P. The algorithm begins execution in a suboptimal stage in which sorting is done by successively merging
pairs of longer and longer runs until the number of runs is
From the collection of the Computer History Museum (www.computerhistory.org)
Exploiting Parallelism for Non-Numeric Applications
215
SUBOPTIMAL
STAGE
OPTIMAL
STAGE
INPUT
DRIVE
POSTOPTIMAL
STAGE
OUTPUT
DRIVE
Figure 6-Parallel binary merge with 4 processors and 16 pages
Figure 7-Binary tree interconnection between the processors
equal to twice the number of processors. During the suboptimal stage, the processors operate in parallel but on separate data. First, each of the P processors reads two pages and
merges them into a sorted run of two pages. This step is
repeated until all single pages have been read. If the number
of runs of two pages is greater than 2 *P, each of the P processors proceeds to the second phase of the suboptimal stage
in which it repeatedly merges two runs of two pages into
sorted runs of four pages until all runs of two pages have been
processed. This process continues with longer and longer runs
until the number of runs equals 2* P.
When the number of runs equals 2* P, each processor will
merge exactly two runs of length N /2P. We term this phase
the optimal stage. At the beginning of the postoptimal stage,
the controller releases only one processor and logically
arranges the remainder as a binary tree (see Figure 6). During
the postoptimal stage, parallelism is employed in two ways.
First, all processors at the same level of the tree (Figure 6)
execute concurrently. Second, pipe lining is used between levels. By pipelining data between levels of the tree, a parent is
able to start its execution a single time unit after both its
children (i.e., as soon as its children have produced one output page).
The ideal architecture for the execution of this algorithm is
a binary tree interconnection between the processors as
shown in J:igure 7. The mass-storage device consists of two
disk drives, and each leaf processor is associated with a surface on both drives as in the PPH organization. This organization permits the leaf processors to do I/O in parallel while
reducing almost in half the number of processors that actually
must do input/output. Analysis of the performance of this
architecture shows substantial performance improvements
(over a single processor) when such a mass-storage device is
available. If, however, only two conventional disk drives are
available, the performance improvement achieved is very limited, which indicates again the need for more I/O bandwidth if
parallel architectures for non-numeric applications are to be
viable.
An architecture based on broadcasting
By using broadcasting as a basic building block, a number
of very simple and relatively fast parallel algorithms for complex database operations can be constructed. As an example,
consider the nested-loops join algorithm shown in Figure 5
and assume that Nand M refer to the number of disk pages
occupied by relations Rand S, respectively, instead of the
number of tuples. Also assume (for the moment) that N processors are available. A parallel nested-loops join algorithm
would begin by having each of the N processors read one page
from relation R. Next, the M pages of relation S are broadcast, one at a time, to all the processors. Upon receipt of a
page of S, each processor will apply the nested-loops join
algorithm to its page of R and the incoming page of S. If P < N
processors are available, the algorithm is executed in N/P
stages.
Furthermore, as shown in Figure 8, the architecture required to support these algorithms is straightforward. All that
From the collection of the Computer History Museum (www.computerhistory.org)
216
National Computer Conference, 1982
BROADC;\ST BUS
Figure 8--An architecture based on broadcasting
is required is a bus that is capable of broadcasting at the same
data rate of the mass-storage device. The main disadvantage
of this approach is that the first step of the algorithm in which
each processor gets a page of R is sequential. It can, however,
be parallelized if R resides on a parallel-readout disk drive.
CONCLUSIONS AND FUTURE DIRECTIONS
From the discussions contained in this paper we are able to
draw a number of conclusions about parallel architectures for
non-numeric applications. First, manufacturers must place
more emphasis on the development of mass-storage devices
that have higher transfer rates if parallel architectures for
non-numeric applications are to become viable. Second, while
we have suggested two different styles of parallel architectures
for execute searching and complex operations, database management systems perform both types of operations. Thus, it
appears worthwhile to investigate the design of a machine
that combines the ability of the "on the disk" machines to
process selection queries rapidly with those features that facilitate execution of complex operations on formatted databases. The RDBM28 represents a first attempt at designing
such a machine.
ACKNOWLEDGMENTS
We would like to acknowledge support of this research by
the Department of Energy under contract #DE-AC0281ER10920 and the National Science Foundation under grant
MCS78-01721.
REFERENCES
1. Siotnik, D. L. "Logic per Track Devices." In Frantz Alt (ed.), Advances in
Computers (Vol. 10), New York: Academic Press, 1970, pp. 291-296.
2. Boral, H., D. J. DeWitt, and W. K. Wilkinson. "Performance Evaluation
of Four Associative Disk Designs." Journal of Information Systems, Vol. 7,
No.1, January 1982.
3. D. DeWitt and P. Hawthorn. "Performance Evaluation of Database Machine Architectures." Invited paper, Proceedings of the 7th International
Conference on Very Large Databases, September 1981.
4. Friedland, Dina B. "Design, Analysis, and Implementation of Parallel
External Sorting Algorithms." Computer Sciences Technical Report #464,
University of Wisconsin, January 1982.
5. Hawthorn P. and D. J. DeWitt. "Performance Evaluation of Database
Machines." IEEE Transactions on Software Engineering, January 1982.
6. Date, e. J. An Introduction to Database Systems. Reading, Mass.:
Addison-Wesley, 1981.
7. Codd, E. F. "A Relational Model of Data for Large Shared Data Banks."
Communications of the ACM, Vol. 13, No.6, June 1970.
8. Knuth D. E. The Art of Computer Programming-Sorting and Searching.
Reading, Mass.: Addison-Wesley, 1975, p. 160.
9. Bancilhon F. and M. Scholl. "Design of a Backend Processor for a Data
Base Machine." Proc. of the ACM SIGMOD 1980 International Conference
of Management of Data, May 1980.
10. Haskin, R. L. "Hardware for Searching Very Large Text Databases." Proceedings of the Fifth Workshop on Computer Architecture for Non-Numeric
Processing, March 1980, pp. 49--56.
11. Parker J. L. "A Logic per Track Retrieval System." IFIP Congress, 1971.
12. Minsky N. "Rotating Storage Devices as Partially Associative Memories."
Proc. 1972 FJCe.
13. Parhami B. "A Highly Parallel Computing System for Information Retrieval." Proc. 1972 FJCe.
14. Ozkarahan, E. A., S. A. Schuster, and K. e. Smith. "RAP-Associative
Processor for Database Management." AFlPS Conference Proceedings
(Vol. 44) 1975, pp. 379--388.
15. Sadowski P. J. and S. A. Schuster. "Exploiting Parailelism in a Relational
Associative Processor." Fourth Workshop on Computer Arch. for Nonnumeric Processing, Aug. 1978.
16. Su, Stanley Y. W., and G. Jack Lipovski. "CASSM: A Cellular System for
Very Large Data Bases." Proceedings of the VLDB, 1975, pp. 456--472.
17. Bryson, D., D. Clover, and D. Lee. "Megabit Bubble-Memory Chip Gets
. Support From LSI Family." Electronics, April 26, 1979.
18. Kannan, Krishnamurthi. "The Design of a Mass Memory for a Database
Computer." Proc. Fifth Annual Symposium on Computer Architecture,
Palo Alto, CA., April 1978.
19. Leilich H. 0., G. Stiege, and H. Ch. Zeidler. "A Search Processorfor Data
Base Management Systems." Proc. 4th Conference on Very Large Databases. 1978.
20. Wong, E. and K. Youssefi. "Decomposition-A Strategy for Query Processing. ,. ACM Transactions on Database Systems, Vol. 1, No.3, September
1976, pp. 223-241.
21. Boral, H., D. J. DeWitt, D. Friedland, and W. K. Wilkinson. "Parallel
Algorithms for the Execution of Relational Database Operations." submitted to ACM Transactions on Database Systems, October, 1980.
22. Astrahan, M. M. et al. "System R: Relational Approach to Database
Management." ACM Transactions on Database Systems, Vol. 1, No.2, June
1976, pp. 97-137.
23. Blasgen M. W. and K. P. Eswaran. "Storage and Access in Relational Data
Bases." IBM System Journal, Vol. 16, No.4, 1977.
24. Babb, E. "Implementing a Relational Database by Means of Specialized
Hardware." ACM Transactions on Database Systems, Vol. 4, No.1, March
1979, pp. 1-29.
25. Goodman, J. R. and A. M. Despain. "A Study of the Interconnection of
Multiple Processors in a Data Base Environment." Proceedings 1980 International Conference on Parallel Processing, August 1980, pp. 269--278.
26. Even, S. "Parellelism in Tape Sorting." CACM, Vol. 17, No.4, April 1974.
27. Batcher, K. E. "Sorting Networks and Their Applications." Proceedings
Spring Joint Computer Conference (Vol. 32) 1968, pp. 307-314.
28. Hell, W. "RDBM-A Relational Data Base Machine: Architecture and
Hardware Design." Proceedings of the 6th Workshop on Computer Architecture for Non-Numeric Processing, June 1981.
From the collection of the Computer History Museum (www.computerhistory.org)