Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database machines and some issues on DBMS standards* by STANLEY Y. W. SU, University of Florida HSU CHANG, IBM Yorktown Heights GEORGE COPELAND, Tektronix PA UL FISHER, Kansas State University EUGENE LOWENTHAL, MRI Systems and STEWART SCHUSTER, TANDEM INTRODUCTION cessors, terminals and communication devices has dropped and will continue to drop at a drastic rate. It is time for a reevaluation of the traditional role of hardware and software in solving problems of today and tomorrow in database management. Fourth, there is a vigorous drive toward DBMS standards led by NBS (26,27) aiming to "1) protect the federal investment in existing data, programs, and personnel skills, 2) improve the productivity and effectiveness of database systems available to federal agencies, 3) assist federal agencies with guidelines on the selection, procurement, use, and availability of database systems, 4) perform the research necessary to identify future federal needs and to foster the development of necessary database tools. " Research on database machines is relevant to the study of DBMS standards in the following ways. First, when a standard is to be proposed for adoption it is important to consider how easy the standard can be implemented and the cost involved in its implementation. Database machines may drastically change the ways database management functions are implemented and new technologies may alter the picture of cost involved in database management. A standard is not practical unless it can be implemented with efficiency and reliability. Database machines hold promise to provide more efficient and reliable ways to implement the database functions. Second, very often several alternative designs (e.g. data models or data languages) exist and can be the candidates for standards. Good evaluation and proper selection of these alternatives based on criteria such as •'user/programmer productivity," "ease of use," "natural to the user and DBA," etc., are extremely difficult to obtain. In this situation, the selection of one of the alternatives as the standard can be based o~, among other variables, how well the se- There are several co-related activities in the database area and computer architecture that make the discussion of database machines and their implications on DBMS standards timely and meaningful. First, in the database area there is a drive toward more powerful database management systems which support high-level data models and languages. The motive for this drive is the requirement to greatly improve user/programmer productivity and to protect applications from changes in the user environment. However, supporting these interfaces with software means often introduces inefficiency in database management systems because of the many levels of complex software which are required to map the high-level data representation and languages to the low level storage representation and machine codes. Second, the need for systems which handle very large databases is increasing rapidly. Very large databases complicate the problems of retrieval, update, data recovery, transaction processing, integrity, and security. Software solutions to these problems work well for both small databases supporting many applications and large databases supporting only a few applications. However, the labor-intensive cost, time delays and reliability problems associated with software development and maintenance will soon become prohibitive as large and highly shared databases emerge. The search for hardware solutions to these problems is a necessary and viable alternative for balancing functionality and price/performance. Third, the progress made in hardware technology in the past decade is phenomenal. The cost of memories, pro- * This work is supported by the National Bureau of Standards under contract #NB 79NAA B4369-1. 191 From the collection of the Computer History Museum (www.computerhistory.org) 192 National Computer Conference, 1980 lected standard is supported by the present database machines and can be supported by the future machines. Third, the change of hardware architecture of a computing machine will have great effect on the design and implementation of a database management system. In particular, new hardware may change the interfaces among the components of a DBMS. Thus, the study of the standards for DBMS interfaces should take into consideration the present and expected progress in database machine research and development. This paper reports on the results of a study conducted under the support of the National Bureau of Standards (contract # NB 79N AA B4369-1) to examine some of the proposed DBMS standards from the point of view of database machines. The emphasis is on the discussion of several issues related to data models and data languages and on how well they can be supported by database machines. The study aims to 1) assess the progress made in the database machine area, 2) determine the functional capabilities and limitations of the present database machines, 3) examine the issues on DBMS architecture, data models, and data languages from the point of view of present and future database machines, and 4) address some technical issues on the technology, the hardware, and software architectures of database machines. II. SOME LIMITATIONS OF CONVENTIONAL COMPUTERS FOR DATABASE APPLICATIONS Several limitations found in the conventional computers motivate the study of database machines. They are: A. Mismatch of conventional computers for database applications In 1948, von Neumann designed the programmable electronic digital computer for numeric applications. The design matched the technology of that day very closely to numeric applications. The semantic definition of numeric data was matched very closely to the storage representation: data semantics x,26 y,-5 z, 1.7 X 1023 random access storage location 1,26 location 2, - 5 location 3,1.7 X 1023 Using a random access storage, only a simple and efficient one-to-one mapping was necessary. Also, the semantics of numeric operations were matched very closely to the hardware instructions: semantics of operations add subtract store in memory hardware instructions ADD SUB STR This close match allowed a very simple and efficient mapping. However, two significant things have happened since that time. First, hardware technology has changed drastically. Cost and speed per function have improved by many orders of magnitude in the last 30 years. The rules of costeffective packaging have changed from minimization of the number of logic gates and memory bits to minimization of the number of IC pins and packages. Secondly, the primary application for digital computers is shifting from numeric to non-numeric applications. In non-numeric applications, the user retrieves and manipUlates data by specifying the attributes and values of the data he is interested in, i.e. addressing data by contents, rather than by addressing the memory locations where the interested data are stored. The basic operations required are Search, Retrieve, Update, Insert, Delete, Move-data, etc., rather than Add, Subtract, Shift, etc. The mismatch of the von Neumann design to non-numeric applications is the main cause of the complexity and inefficiencies of the present systems. The large and growing market for database systems warrants a reevaluation of the relationship between technology and database applications. If technology can be matched more closely to database applications, then perhaps advanced functionality, ease-of-use, and data independence can be achieved cost effectively. In all considerations however, the most serious drawback is the lack of appropriateness of the sequential machine for the parallel process of data manipulation. One can liken this to the analogy of viewing a three-dimensional cube on a twodimensional surface. Some forms are still recognizable; however, many others are skewed and hence do not appear 'normal.' So it is with processors. The software problems become necessarily more complex, simply because the representation is not appropriate. By providing a more appropriate environment, perhaps the 'skewedness' of present problems can be reduced. B. Many levels of mapping Recent research efforts show that high-level data models and data languages which exhibit a high degree of data independence and ease of use are requirements to improve human productivity as well as act as logical interfaces with database systems. Currently, the implementation of highlevel data languages and data models requires many levels of complex software to be executed, causing inefficiencies in system utilization and response. The software complexity and system inefficiency are due to the requirement that highlevel commands and data views be translated into the low level machine codes and structures. In particular, software implementation of high-level data representations requires that auxiliary data structures such as inverted files, directories, and pointers, etc., be introduced to speed up data accesses for a particular set of applications. These auxiliary data structures must be properly maintained. This requirement complicates the updating operation, one of the most important database management functions, and significantly decreases its efficiency. Also, since these auxiliary data structures are tailored for a particular set of applications, a change in application often requires a large labor-intensive From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards software maintenance project. This considerably increases cost and time delays and decreases reliability. C. Performance bottlenecks There seem to be two major performance bottlenecks in the present systems: the staging bottleneck and the communication bottleneck. In conventional systems, data are not stored at the place where they are processed. To "stage" data into main memory for processing is very time consuming, and often ties up the important resources of a computing system, such as communication channels. Database applications will continue to demand larger and more complicated databases, requiring more time to stage and process the data files. In order to support very large databases (greater than 10**10 bytes), or databases requiring both fast update and/ or complex query, it is necessary to exploit specialized hardware to eliminate unnecessary data staging and to carry out database management functions efficiently. Data communication over long distances is expensive and limited in speed. This forces many database systems to physically distribute data to locations where usage is highest. Data redundancy is often purposefully introduced in distributed systems to avoid excess amounts of data transfer and to improve performance and reliability. However, many additional problems on data updating, recovery, integrity, and security in distributed systems are introduced by the above techniques. Special purpose hardware tailored toward managing distributed database management and supporting data communication would be very useful. D. User's increasing demands for DBMS capabilities Database management system users are continuously demanding more sophisticated DBMS capabilities. Capabilities such as automatic database restructuring and system tuning, automatic data distribution and redistribution, backup and recovery, integrity and security controls, etc., are generally handled by software in the traditional systems. Tremendous overhead is generated in implementing these capabilities. Because systems are currently pushing software complexity barriers, performance improvements in this area are not likely without dedicating hardware to unburden saturated systems. III THE OBJECTIVES AND CHARACTERISTICS OF THE EXISTING DATABASE MACHINES A database machine (DBM) can be defined as any hardware, software, and firmware complex dedicated and tailored to perform some or all of the functions of the database management portion of a computing system. The DBM may range from a small, personal query machine (intelligent terminal) to a large, public-utility information machine. We shall categorize the existing database machines into four categories based on their architectural distinctions and their 193 differences in objectives and characteristics. Each category of machine attempts to remove some or all of the limitations discussed in the preceding sections. In the following presentation, only the recent systems designed for general purpose database management applications are covered. Systems designed for text processing, document retrieval, sorting, etc., which are database machines in their own right, are not included. Category 1,' cellular-logic systems A cellular-logic system consists of a linear array of cells each of which contains a processor and memory element i [47]. The genearl architecture of cellular-logic systems is il- . lustrated in Figure 1. A database operation such as Search, Retrieve, Update, Delete, or Insert is broadcasted simultaneously to all the processors which carry out the operation against the data residing in their associated memory elements. Thus, in one rotation of the memory, the entire database is reached in lIn(th) of the time needed for a sequential search over n segments of data. Efficiency in data searches and other database operations is gained by the parallel processing elements. The memory elements of these devices can be disk tracks, bubble memories, CCD's, RAM's or other types of memories. The cells in these devices may communicate with their adjacent neighbors. This category of devices, thus, refers to a more general class of machines than the logic-per-track concept introduced by Slotnick [46]. The basic idea of cellular-logic systems is to move some of the frequent database management functions to intelligent secondary storage devices so that these functions can be carried out by the storage devices without the attention of the main processor. The data stored on the rotating devices such as disks, drums, CCD's, or magnetic bubble memories are systematically and exhaustively searched by the processing elements, one for each physical or electronic track of the rotating memory. Thus, data are processed on the same device where they are stored. Irrelevant data can be filtered out by the secondary storage devices and only the relevant data are brought into the main memory for further processing, thus avoiding the problem of staging described in the preceding section. Furthermore, since the entire database is exhaustively searched in each circulation of the memory, data can either be searched associatively by contents (i.e., by specifying what data are to be searched rather than where the data can be found) or by contexts (i.e., by specifying the neighborhood where relevant data can be found). The content and context search techniques in the cellular-logic devices offer uniformity and fast response time for search and update operations without the need to build and maintain special supportive structures such as indexes, hash tables, pointers, etc., used in the conventional systems. Data can be stored in these machines in a form very similar to the data structure defined in the conceptual schema of a database. Thus, the difference between the conceptual schema and the internal schema of a database in these machines is not as distinct as in conventional systems. The From the collection of the Computer History Museum (www.computerhistory.org) 194 National Computer Conference, 1980 CIRCULAR MEMORY PROCESSORS CONTROLLER Figure I-Cellular-logic configuration. complex mapping between the two data representations can often be avoided. Four basic architectural decisions which lead to an improved packaging of the technologies for the exhaustive associative search are examined as follows: (A) The hardware consists of a regular arrangement of identical cells. The argument for this decision is as follows. First, the development and manufacturing costs of LSI and circuit boards are minimized, since only a single generic chip need be developed and manufactured and arranged uniformly on circuit boards. Second, reliability is improved because of the overall simplicity of this approach and because several simple schemes can be used to provide dynamic recovery from hardware failures. Third, the system can easily be expanded modularly without causing disruption to the system organization. As the database grows, increased storage is accompanied by increasing processing power, so response time remains independent of database size. (B) Instead of using higher order arrays or tree structures, a one dimensional array of cells is used. The reasons are as follows. First, a one dimensional array minimiz~s the number of LSI pins per cell, since communication is restricted to fewer cells. Second, the number of pins per package is independent of the number of cells per package. This is very important, since it allows us to directly exploit the drastic, yet consistent, improvement in density, without increasing the number of pins per package. No other arrangement can accomplish this. Improved lithography and circuit designs promise to make further improvements by a factor of 100 in area density by 1990. Third, hardware utilization is most easily achieved using a one dimensional array, since fewer (only one) restraint must be met. For example, a two dimensional array requires two restraints. Users of the ILLIAC IV (Kuck [29]) have found this to be very awkward. Furthermore, variable-length data objects can easily be linearized onto a one dimensional array. (C) Each cell has a dedicated processor and memory. The reasons for taking this approach are as follows. First, experience has shown that using N processors that can access M memories leads to severe interconnection contention, so that neither processors nor memories are well utilized. A fixed one-to-one relationship between processors and memories allows an efficient utilization of both. Second, it also removes the complex reliability and packaging problems involved in a large interconnection switch. Third, the parallelism inherent in the exhaustive search can be directly exploited. Fourth, the amount of memory per processor can be varied to allow a family of database machines to be built using the same architecture. This allows trade-offs between cost and response time to be matched to different user environments and changing technology. (D) Block-organized memories that are serially accessible within each block can be used, such as charge-coupled devices (CCD's), magnetic bubbles, and discs. These memories are generally cheaper per bit than memories that allow random addressing at the character level. Such memories will generally be classified as slow access. However, they are slow only when used to emulate a random access memory. When used for the exhaustive associative search, they are as efficient as a truly random access memory. In addition to searching efficiency, these devices offer efficient storage management for updates. Because of their dynamic nature, data can be inserted in place at the maximum data rate of the memory. Also, supportive data structures such as indexes, pointer, hash tables, etc., are eliminated and the effective cost per bit is further reduced. In summary, the block serial nature of these devices can be fully exploited to improve simplicity, efficiency, and data independence. From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards Several systems have been designed based on the cellularlogic approach and some have gone through prototype implementation. A few of these systems are briefly mentioned here. More details can be found in the papers included in the special issues on database machines [9,24]. The CASSM project began at the University of Florida in 1972. The aim was to invest the hardware and software characteristics of various associative techniques. Direct hardware support of relations, hierarchies, networks, and string processing was investigated [11,32,48]. These hardware data types were implemented without any restrictions on length. Also, storage retrieval of instructions directly from the associative memory (associative programming) was studied. Associative programming is presently being studied at the University of Florida [50] under a continued NSF grant, with the CASSM architecture simulated in software [49]. The RAP project began at the University of Toronto in 1974. RAP [41,44] was intended to provide direct hardware support for the normalized relational model, with restrictions on the length of tuples. The RAP project also contributed to the understanding of several systems level considerations, such as the use of RAP as a staging device for very large file systems, and system throughput under a multi-user environment. Since its initial design, RAP has gone through some substantial changes. The most recent version, which is described in Schuster et al. [45], reduces the restrictions on the length of tuples. The RARES project began at the University of Utah in 1976. RARES [30] provided hardware support for normal- ized relations with length restrictions. The RARES storage structure was chosen to optimize output efficiency. A research project called INDY began at Tektronix in 1977. INDY [10] directly implements a kernel language that is based on strings and classical sets with no hardware restrictions on length or cardinality. This kernel language acts as a meta-language that is generalized enough to directly describe various data languages and views, providing a simple closed mathematics for facilitating translations between views. A recent project undertaken by Chang [7] at IBM, Yorktown Heights, investigates the use of magnetic bubble memories for supporting relational databases. A modular, configurable, electronically-timed magnetic bubble storage has been studied. The system follows the general concept of logic-per-track while a track in this case is a magnetic bubble chip with a modified major-minor loop organization. The proposed bubble chip configuration is shown in Figure 2. The storage minor loops are grouped to correspond to domains in a relation. The transfer line is segmented to allow the selection of a minor-loop group (i.e. a domain) to be accessed individually. The short buffer loops between the major and minor loops alleviate the problems arising from the rigid synchronization of the major and minor loops. The off-chip marker loops, being one-bit wide in contrast to being interspersed with many-bit large data records, can be quickly scanned to identify previously marked tuples. Since the minor loops allow parallel advance of data while the major loops only permit serial read-out of data, the quick scan fea- ~ MAJOR LOOP TRANSFER (N LINES) READ BUFFER (N GRClJPS) BUFFER TRANSFER I I I '- _ _ _ .... I I ~ , _ _ _ .J o- -0 -0 - -0 - - - -- WRITE { I .... BUFFER (N GROUPS) TRANSFER (N LINES) MAJOR LOOP 0_0_ N ------------------ o DO DETECTION I ---~ 2 MINOR LOOPS BUFFER TRANSFER 195 o o 0 F---~ I OFF-CHIP MARKER LOOPS ANNIHILATOR Figure 2-Modified major-minor loop organization. From the collection of the Computer History Museum (www.computerhistory.org) 196 National Computer Conference, 1980 ture of the marker loop can eliminate the output of unqualified data, thus greatly enhancing performance. The project clearly demonstrates that bubble memories have several desirable characteristics which can be utilized advantageously to support database management. In summary, the distinguishing features of the cellularlogic approach are 1) increased processing capabilities in secondary storage devices to reduce the need for data staging in the main memory, 2) search time is independent of the database size, 3) elimination of the need for building, updating, protecting auxiliary structures, 4) the use of identical cells to increase reliability, flexibility in adding or. reducing the number of cells and to reduce the cost of production, and 5) the potential for extremely high speeds as cell sizes decrease and memory density and speed increase (i.e. increase in the ratio of processing power to memory). Although most of the systems described here have gone through prototype implementation and testing, performance data from a real application environment is still lacking. The existing prototypes have rather limited processing capabilities. Many of the DBMS functions will still have to be handled by a conventional computer. Also, the staging problem described in Section II will not be totally eliminated if large databases are stored on archival memories and have to be moved to cellular-logic devices. Category 2: backend computers Backend computers in database systems are dedicated computers for carrying out databases processing functions such as the retrieval and manipUlation of databases, the verification of data access, the formulation of responses, the enforcement of integrity and security rules and constraints, etc. Backends are usually general purpose computers even though special purpose machines can very well be used. Figure 3 shows one possible configuration, the operating system, application programs, and DBMS interface run on the host computer, and the actual DBMS runs on the backend computer. The key concept of backends is to off-load the database management functions from the host computer to dedicated processor(s) in order to 1) release the host from tedious and time-consuming operations involved in database manipulation, maintenance and control, and 2) increase system performance through functional specialization of and through parallel processing among the host and the backend(s). The primary impetus for the backend approach is, of course, to reduce the cost of managing data. The backend approach can be viewed as a cost-effective alternative to upgrading the host or to achieve the level of functionality and performance that no conventional system can provide. The isolation of the DBMS, the mass storage devices and the database from the host can bring a number of additional advantages. First, ·several hosts, possibly dissimilar, can share on-line data in the configuration shown in Figure 4. A single backend may handle the processing of the database and present data in forms suitable to the dissimilar hosts. Second, databases and the DBMS itself can be transported Arplic3tion Programs and DBMS Tnterface Operating System Host Operating System Backend t DBMS (Schema, Subschemas, DML Tasks) 1 Storage Database Figure 3-A configuration of a backend computer system. from an old mainframe to a new one with relatively little conversion effort. Similarly, changes to the databases, the mass storage devices, and the DBMS (e.g. adopting a standard DBMS) can be made without entailing changes to the host. Third, storage devices including special purpose cellular-logic devices or bubble devices can be made available through backends to mainframes that do not otherwise support these devices because of I/O or operating system contraints. Fourth, multiple numbers of backends (see Figure 5) can be used to process large databases which can be stored either in a distributed manner across secondary memory devices to facilitate parallel processing or in a manner such HOST 1 ---0 DATA BASE Figure 4-Multiple host configuration. From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards BACKEND BACKEND BACKEND 1 2 K DATA BASE DATA BASE 197 DATA BASE Figure 5-Multiple backend configuration. that one database can be processed by one backend. Lastly, the enforcement of database integrity and security can be separated from that of operating system integrity and security; thus the failure of one will not endanger the other. The first development of the backend system occurred at Bell Laboratories [5]. This system was called the Experimental Data Management System (XDMS) and was undertaken to both demonstrate the capability of the backend concept as well as implement the new CODASYL DBMS specifications. The implementation required eighteen months and six man years of effort. The system was implemented to a level of experimental usefulness and the concept was verified. The Data Computer is another example of the backend processor approach. It is a large-scale database management system running on a PDP-lO and has been implemented for use in ARPANET [36] by Computer Corp. of America. The Data Computer essentially provides facilities for data sharing of a single database among dissimilar host computers in a network environment. That is, it is implemented through a communication scheme involving the identification of the host processor type so that data to be retrieved and sent by the Datacomputer can appear in the format expected by the requesting host. Likewise data which are to be stored by the Datacomputer are converted upon receipt from the identified host and stored for use as the originator sees it. With such a scheme, the amount of storage can be continually expanded, performance can be maintained by replicating the systems, and the backend machines are available to all hosts in the network. Some additional developments indicate the possible direction in which this movement may be heading. In the past few months, Cullinane Corporation made available to four government agencies IDMS implemented on a PDP 11170 capable of supporting an IBM or IBM-compatible host. One participating group (within the Navy) is just now beginning a very serious evaluation of the utility of such a system in their production environments to extend the useful life of their existing computing facilities. During the period of time Cullinane Corporation was implementing IDMS for use in a backend, Kansas State University [16,17,37], under a grant from the U.S. Army Computer Systems Command, was developing a prototype network system built around a machine independent, highspeed bus system (20 mega bytes/sec transfer rate) which would permit heterogeneous computers to communicate in any topology desired. With this communications support software finished, a natural application was the backend environment. The software design documents were furnished to Cullinane along with the host software. Addition- From the collection of the Computer History Museum (www.computerhistory.org) 198 National Computer Conference, 1980 ally, Cincom's DBMS system (TOTAL) was modified to run on an Interdata 8/32 backend from either the IBM host or another mini in the network acting as a host. A great deal of database machine activity is occurring in Jap"an. One project defines a database machine called ODS" -a generalized database subsystem-which has a sufficiently low-level interface to provide potential support for any data model [18]. One major contribution is its ability to interface directly to the main memory of its host so that II o overhead incurred by the host CPU during large data transfers can be avoided. The existing backend systems are still experimental in nature. The desirability of backend is yet to be proven by performance evaluation and measurement of "real" systems. In conclusion, the idea of extending the functionality and performance of a mainframe by dedicated backends is a sound one. However, this approach does have its adverse problems. For example, the backend(s) introduces different hardware with the attendant problems of maintenance, software support, and the additional procurement effort and cost. Also, the balanced assignment of DBMS tasks to the host and the backend(s) is not a simple problem. More discussions on backends can be found in [33,42]. Category 3: integrated database machines This category of systems uses a number of functionally specialized processors, which can be general-purpose and/ or special-purpose processors, to implement the processes of a DBMS. Systems of this type may use, for example, specialized associative processors for the processing of directories and mapping data, intelligently controlled disks and mass storage devices for the storage and processing of the major portions of the database, a system processor for general coordination, and dedicated hardware for security control. By the use of the functionally specialized hardware and the parallel processing capabilities of a family of machines, these systems aim to achieve greater efficiencies in database management. The highly modular family of machines gives users the opportunities to mix and match process and storage ; capacity. Different from the cellular-Iogi~ systems in category 1, this category of systems are larger and more complete systems of which a category 1 system can be a component. The specialized hardware units used in these systems are quite different. They lack the uniformity of the cells in category 1 systems. This category also differs from category 2 systems in that functionality and performance are achieved mainly by hardware (and thus software) specializations rather than software specialization alone used in the existing backends. It should be noted, however, that the distinction made would not be clear if special-purpose hardware devices were used in the backend systems. Nevertheless, we can say that the design of this category of systems involves treating hardware, software, DBMS, and databases as a whole rather than simply extending the capability of a given mainframe using backends. Some example systems of this category are the following. The Data Base Computer (DB C) project at Ohio State Uni- versity proposes an architecture where every major DBMS function has a dedicated processor and whose overall organization exploits pipe-line parallelism [1,3,20,21,22,23]. It contains various associative processors for logical data model and disk memory mapping. It also proposes several architectural changes to moving head disks to increase bandwidth an order of magnitude over today's secondary storage data rates. The integration of the security function into the DBC's architecture is also considered. The RAP.2 effort at the University of Toronto has expanded its research by formulating the RAP (a category 1 machine by itselO associative processor's role in an integrated database machine. Most of the work has centered around data partitioning or staging strategies where database and schema data reside partially on disk and partially on associative processors [45]. The INFOPLEX system proposed at MIT is an example of integrated database machine architecture [35]. It utilizes new microprocessor capabilities by organizing a memory and processor hierarchy which takes advantage of the parallelism inherent in.concurrent requests to maximize performance. Another direction is to make use of low cost currently available microprocessors to form a simple network system for processing distributed databases using a single-instruction multiple-data stream architecture (SIMD). In this case, segments of data files are stored across memory devices each of which is dedicated to a microprocessor. Software tasks for a database management system are simultaneously carried out by the processors against the contents of the local memory. This alleviates much of the switching time overhead found in a network systems with shared memory. A recent example of this approach is the MICRONET system being developed at the University of Florida [51] using a PDP 11-60 and four LSI-II computers. Another multiprocessor system called DIRECT [15] is designed for supporting relational database management systems using a multiple-instruction multiple-data stream (MIMD) architecture. Microprocessors are dynamically assigned to a query depending on its priority and the type of relational algebraic operators it contains and the size of relations referenced. The system is being implemented using LSI -11103 microprocessors and CCD memories which have associative search capability. In summary, the main characteristics of this category of database machines are 1) the use of functionally specialized hardware to achieve efficiency, 2) the system's approach to the design of hardware, software, DBMS, and databases, and 3) the modular family of machines allows users to exploit parallel processing and pipelining techniques. However, the hardware interconnection, the data and program communication, and the operating system support in a system using dissimilar hardwares can be rather complex. The proper identification of DBMS functions for implementation in hardware remains a challenge. Category 4: high-speed associative memory systems In this category of machines, a high-speed associative memory is used together with conventional memory devices From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards such as core memories, rotating memories, or shift registers to form a hierarchy of memories for data processing. Databases are stored on conventional secondary storage devices. Data are moved from the slower secondary storage to the associative memory for high-speed searches by content or context. The same characteristics which make a cache for speeding up data reference of main memory are used here to improve data access to secondary storage. Figure 6 shows a typical configuration of this type of system. The associative memories used in these systems differ from the cellular-logic systems in that each bit or each word rather than a segment of memory has a processing element. Associative searches can be carried out in all bits or words of the memory simultaneously and thus are much faster than the sequential scan of memory segments in rotational devices. The technology used for high-speed associative memories is faster than the rotation devices. However, it is far more costly. A good example of the high-speed associative memory approach is the STARAN computer system [2,12,43]. The key element of the system is a set up-to-32 associative processor arrays which provide content addressing and parallel processing capabilities. Each processor array is a multi-dimensional access memory matrix containing 256 words by.. 256 bits with parallel access to a maximum of 256 bits at a time. The access can be in either the word or bit direction. Associated with each word of a processor array is a processing element which examines the content of the word and manipulates the word bit-by-bit serially. Control signals are broadcast to the processor elements in parallel by the control logic unit and the processor elements execute instructions simultaneously. Data stored in the main or secondary storage of a conventional computer system are paged in and out of the processor arrays for associative searches. Program instructions of the associative processor are stored in a control memory which consists of three fast page memories made of volatile, bipolar, semicoriductor elements and a core memory block. Program segments stored in the core memory block are paged to the fast memories before execution. The control logic unit fetches and interprets the instruction from the control memory and transfers control signals to the processing elements of the processor arrays to manipulate data in the arrays. Although the associative array processor was originally built for air traffic control and other real time sensor surveillance and control applications, the content addressability and parallel processing capabilities of the processor provide many desirable features for database management. A DBMS built around a four-array STARAN has been reported by Moulder [40]. Other work based on this system and a hypothetical associative memory for use in a database management environment can be seen in DeFiore and Berra [13,14], Berra and Oliver [4], and Linde et al. [31]. Figure 6-A typical associative memory system configuration. 199 The principal benefit of this approach is improved performance. The use of high-speed associative memory reduces the effective access time of the mass memory where databases are stored. However, due to the high cost of building this type of memory and processor, the size of associative memory is rather small. In a database management environment, considerable amounts of data will have to be paged in and out of the associative memory to take advantage of its capability. Although data can be searched in high speed once the data are in the memory, to stage data into the memory can become a bottleneck of this type of system. For certain types of applications such as table look-up and directory processing, the use of high-speed associative memory will result in an order of magnitude improvement in performance at relatively low incremental cost. Where there is little locality of references, however, the potential cost benefit will not be realized. IV. DATABASE MACHINES AND SOME ISSUES ON DBMS ARCHITECTURE, DATA MODEL AND DATA LANGUAGE DESIGNS RELATED TO DBMS STANDARDS Having described the motivation, objectives, functionalities, and challenges of the existing database machines, we shall now look into some of the issues on DBMS architecture, data model, and data language design from the viewpoint of database machines. Many issues discussed here have often been raised by researchers and practitioners. They are very relevant to the standardization of DBMS architectures, data models, and data languages. A. DBMS architecture issues DHM's support of multi-schema architectures The DBM technology could conceivably make those DBMS architectures which involve multiple numbers of schemas (e.g. the ANSI/SPARC architecture) very cost-effective. That is, it could have performance features that reduce the cost and complexity of the various schema mappings. The commitment to separate user views, logical data structure, and physical data structure stands on its own right. It is not compromised by the fact that we are limited to von Neumann processors, disk, tapes, etc., today and it should not be compromised by what happens tomorrow, particularly since we can make the separation increasingly economical through DBM technology. With respect to the standardization of the DBMS architecture, it cannot be stated categorically that DBM technology as such is going to push us toward a particular conceptual data model and external data models. Rather the DBM will probably support whatever is wanted as the "best" conceptual data model (by whatever critera) and its mappings to external models and the mapping to the internal data model (including the internal data model of the DBM itself, the distribution among various mass storage devices, and distribution among geographically separated database systems). The internal data model is From the collection of the Computer History Museum (www.computerhistory.org) 200 National Computer Conference, 1980 probably not "standardizable," because, first, it does not need to be. Programs and end users do not see it or depend on it. Secondly, it must adapt to changing storage technologies including the DBM, storage hierarchies, geographically distributed databases, etc. Therefore it is important to separate the internal schema from the conceptual schema and keep it flexible and extensible. Mappings between external and conceptual schemas The mapping be~een external and conceptual schemas may involve a subset mapping and a restructuring mapping.Subset mappings are necessary to provide privacy from unwanted queries, security from unwanted updating, and user convenience by removing all data that are not of concern to the user. Restructuring mappings are necessary to provide data structures that are convenient for user applications, and to provide support of multiple user models and languages. A DBM can play an important role in implementing these mappings with efficiency and simplicity. It is possible to store and manipulate these schema descriptions on database machines as simply another database, where mappings are accomplished using queries to the schema descriptions. However, to do this, database machines must be capable of a more generalized pattern matching capability for strings and sets. This is necessary since these schema descriptions usually involve searching abstract or axiomatic (e.g., set theoretic or predicate calculus) representations, rather than simply searching actual data instances. Ideally, the same hardware would be use for actual data and for both external and conceptual schema descriptions. Mappings between conceptual and internal schemas Some database machines can allow the storage structure of a database as defined by the internal model to be very similar to the structure defined in the conceptual model, and thus simplify the mapping process. For example, a relation in the community view can be stored and searched in an associative memory without index tables, hash tables, pointer arrays, etc., commonly introduced in conventional systems. This means that any data stored on these machines requires only the simplest of mappings to its internal schema. However, this does not necessarily mean that the internal schema of the entire database system will be simpler. In a large database system, an associative memory would probably be one out of a whole hierarchy of memory devices, each featuring its own tradeoff between cost per bit and response time. If the associative memory is used and managed as yet another component in a large system, it could add some complexity to the overall internal schema. Instead, the architecture of the entire database system should be reexamined with database machines in mind. Its unique qualities can be exploited to simplify the overall system. The unique features of associative machines are fast response times and simple mapping between the conceptual and the internal schemas but with a higher cost per bit than mass storage devices. The following three systems functions seem appropriate for associative machines. One function is the direct storage of databases whose requirements for speed warrant a higher cost per bit. A second function is to manage the mappings between the conceptual and internal schemas for databases stored on mass storage devices or for geographically distributed databases. The distribution of data among various mass storage devices or among geographically separated systems can be described and stored directly in associative machines as simply another database. Schema mappings can be implemented using queries from the internal and conceptual schema descriptions. Associative machines offer the potential for storage and querying of abstract representations. An internal schema that uses abstract representations, rather than involving actual data instances, has the potential advantages of a more compact description and one that requires no updating when up-dates are made to the actual data. A third function of associative machines is to act as a staging device for large blocks of mass storage. Most mass storage devices are accessed by location. Efficient use of these devices usually requires clustering of data into many large physical blocks, which is biased to certain access paths. After queries to the internal schema (directories) have reduced the number of blocks involved in a retrieval to a small number, associative machines can then be used to further search these blocks. B. Data model issues Database machines support of data models A DBM can be implemented to support any existing data model. For example, RAP, RARES, and DIRECT were designed specifically to support the relational model. The CASSM and INDY systems can support hierarchies as well as a subset of relational algebra operators and string pattern searches. The ASP system was designed to support a form of the network model. Although it was not compatible with the DBTG model, such an implementation should not present any major problems. Finally, any general purpose backend computer can be programmed to support any or all of the models', simultaneously. The implementation in hardware of a single model does not preclude it being used to support other models. For example, a system that directly supports relations can be used to simulate hierarchical and network models. They can be implemented by setting aside items called "associative links" or "context pointers," in record occurrences (tuples) to store identification and structural data. Implementing hierarchies and networks requires the ability to implement "functional associations" between occurrences of record-types [52]. A record-type is analogous to a relation. A functional association can be defined as a l:N (i.e., a one-to-many) linkage or mapping between record occurrences of two relations. That is, if a I:N linkage exists between relations A and B, then one record occurrence of A can be associated or linked with zero or more unique records of B. Each B record will have at most one A record for a particular association. An association or link is equivalent to a "set" in DBTG terminology. Restrictions on the ap- From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards plication of functional associations between record-types determine if the database schema is hierarchical or network. One way to implement an association is to allocate an item called ASSOC in the relation that acts as the domain of the functional association. This scheme is shown in Figure 7. The item ASSOC acts as the associative link. Each record occurrence must have one item whose value uniquely identifies the relation and each particular occurrence within the relation. This item will be called ID for identification. For each record of B that is associated with one record of A, the record ID value of A is stored in the ASSOC item of B. Finding records of B associated with a particular record of A or vice versa is simply a matter of using the associative cross selection or join instructions which interrelate two relations through comparable ID and ASSOC values. A second way to associate records of the same or of different types is to create a new linking relation which contains two (or more) ID items-one for each record-type. This relation, called LINK, associates one record of A to one record of B by storing the associated ID' s of the two records in one occurrence of LINK. This scheme has the advantage of implementing M:N, or many to many, associations between record-types. An example is shown in Figure 8. It should be noted that only "information carrying" associations need be implemented with links. All other relationships which can be derived directly from the values in _ _ _ _A_ _ _---' - - - - > a) l:N association between record-types A and B. I 1 ID-A I A-items b) B_ _ _ _--I I - I_ _ _ _ _ ID-B ASSOC-A B-items I Record-types with associative link fields. B Records A Records Al Al A2 c) Example record occurrences. Figure 7-Implementing associations with relations; a) l:N association between record-types A and B, b) Record-types with associative link fields, c) Example record occurrences. a) M:N association between record-types A and B. LINK A I b) ID-A A-items I ID-A I ID-B I ID-B B-items A and B record-types with LINK relation. LINK Records A Records c) 201 Example record occurrence~. Figure 8-Implementing associations with a LINK relation; a) M:N association between record-types A and B, b) A and B record-types with LINK relation, c) Example record occurrences. the records can be handled directly through associative cross selection or join instructions of relational DBM's. Of the three data models, the relational model is the most general in terms of the types of associations it can represent. It also requires the least number of basic or primitive operations to implement a relationally complete data manipulation instruction set. Also, its simplistic record structure and orientation to sets-of-records operations makes it a natural candidate for DBM implementation. From the above comments, it may appear that the relational model may be the easiest to implement and result in the best performance. However, we must be careful about jumping to conclusions. Many of the additional features of hierarchical and network models were proposed because of the need to improve transaction processing performance. The same techniques that have served software implementations will likely serve hardware as well. Also, the users application may better lend itself to hierarchical or network modeling. In such cases, hierarchical or network hardware will probably out-perform relation hardware using software and data to simulate other models' primitives. Also, many transaction applications do not require complex search nor are the sets of records to be processed large. In fact, todays online transaction processing applications are dominated by having a large number of concurrent transactions requiring relatively simple search and update interactions. These types of operations are the least likely to take advantage of the set-oriented associative processing capabilities of relational or set theoretic DBM's. Of course, a major reason why existing computerized database applications predominately require simple searches and updates is that an adequate implementation of more complex models is not available. From the collection of the Computer History Museum (www.computerhistory.org) 202 National Computer Conference, 1980 Judging from existing examples, the DBM will very likely make the more advanced conceptual data models (e.g. relational or set-theoretic) more feasible to implement, where as today they are frequently judged very complex to implement efficiently as a general purpose system for a broad base of applications. Thus we should be able to choose a standard model based on user benefits and assume with confidence that the performance gap will gradually close. C. Data language issues We now turn to data languages, collectively consisting of all languages for directly manipulating database data on behalf of application programs or end-users. Thus data languages include data sublanguages, which are extensions to conventional programming languages, and self-contained languages (such as query languages, report generators, "query by example" and other end-user interfaces). Data sUblanguages in particular are the target of standards efforts because of the need to protect the user community's investment in computer programs that use these interfaces. Any practical standard takes into consideration user requirements; e.g., proper functionality and ease of use, and feasibility-is there a reasonably efficient, economical implementation of the proposed interface? The feasibility condition creates tension in times of rapid technological innovation, when ground rules for judging what is possible or economical are subject to radical change. This appears to be the case for data languages, not only because of DBM development, but also in view of the slow but steady trend toward hierarchies of storage and geographically distributed data processing. The following paragraphs tell this story: The bad news is that the ability to improve price/performance through technology is very sensitive to the character of the data language. The good news is that we can predict well in advance what features data languages must have to fully exploit emerging technology. Furthermore there is a strong indication that these same features are desired by the user community independent of technology considerations. If so, then the standards makers have their work cut out for them. Whereas everyone appears to agree that end-user oriented languages should be high level, there is an ongoing controversy concerning whether high level data sublanguages are desirable. On the one side are those who argue that programmers should have relatively low level facilities so that they can fine-tune performance tradeoffs. The other side contends that in an era of increasing programming costs and decreasing hardware costs it is best to optimize programmer productivity through the use of high level facilities and let the system worry about efficient hardware utilization. Technology trends and the DBM in particular strongly support the latter position. We will briefly examine some reasons for this. A database machine can sometimes be "tightly coupled" to the hardware which makes use of it. Forinstance a mainframe manufacturer could develop a backend which is enclosed within the host itself and communicates with main memory through a very high speed bus. Or a multifunction terminal might be plugged directly into a small "query machine." In such cases there is no concern that communication with the DBM will be a performance bottleneck. But suppose that the DBM is not developed by the host manufacturer,or that is designed to serve multiple hosts. Or suppose that a DBM is required to communicate with remote hosts in a network, or even with other DBMs to support a disturbed data base (Figure 9). The need to do all of these is bound to arise, so the DBM developer must evaluate the response and throughput implications of loosely coupling the DBM to an external 110 interface or an even slower telecommunications channel. There is nothing in the ten- LOCAL DATABASES AND SEGMENTS OF HOST lOCAL NETWORK \ HOST \ DBM . High level vs. navigational data languages As will be seen, the underlyirig technical considerations generally motivate the development of very high level data languages, by which we mean languages in which the user/ programmer expresses to the database system what results are expected instead of, or in addition to, how the results are to be obtained. With regard to high level data languages it must be recognized that: -"high level" and "low level," like "procedural" and "non-procedural," are relative terms; -self-contained languages are not the only languages that can be high level or non-procedural. There is no intrinsic reason why a data sub language cannot be high level even if the programming language in which it is embedded is low level. See, for example, the use of ALPHA in [8]. o LONG DISTANCE COMMUNICATION NETWORK (RELATIVELY lJJW SPEED) IN~ INTER LOCAL NETWORK Figure 9-DBM nodes in a distributed environment. From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards year picture to suggest that the price/performance penalty' for loose-coupling will go away (otherwise the economic ; argument for distributed data processing would lose most of its force). The DBM developer is therefore motivated to minimize the amount of data that must go in or out of the DBM in , order to get a user's job done. He must also strive to min- : imize the number of separate messages, large or small, to reduce the communication burden. All of this has a direct bearing on the data language available to the user. In the extreme case, if the user can express his job in a singl~ data language statement, and if that statement can be directly interpreted by a DBM, then obviously the communication overhead has been reduced as much as is possible. If, in contrast, the job must be decomposed by the user into a program with several lower level sUblanguage statements, possibly executed in a loop, then the number of messages and amount of data tninsferred will increase dramatically. For example, suppose the user is to mark "inactive" all posted accounts for which there have been no debits or credits during the last twelve months. Given a powerful data language capable of dealing with entire sets of data, this transaction can be expressed with a single statement-a single "call" to the database system and no database records transferred. Given a record-at-a-time ("navigational") data.. language, there would be at least two calls to the system for each inactive account, one to retrieve the record and the other to store the modified version. There are halfway measures which preserve the navigational nature of low level data languages, but would still reduce some of the DBM interaction. For instance, high level intention declarations are a possibility (Lowenthal [34]). If, in the above example the user could state in some fashion, ' "I intend to update all accounts for which there have been no debits or credits posted during the last twelve months," then the system could subsequently buffer blocks of multiple records between the host· and' DBM, but move one record at a time to or from the user's program. This wouldn't reduce the amount of database data transferred, but it would. cut down the number of messages between the host and DBM (each message would be longer). This technique is useful when sequential treatment of data is ultimately unavoidable by any means, such as if a program is required to produce a list of the accounts that have been marked inactive. Consider another method of capturing the high level meaning of an operation expressed in a low level data language. Suppose that the results to be obtained are such that the programmer can write a special kind of subroutine in which the only data referred to are the parameters, the database data retrieved in the subroutine, and some constants established in the sub-routine. This subroutine does not refer to global (common) data, does not read or write non-database files, and does not call other subroutines. Given such constraints, it is feasible to transfer the entire subroutine to the DBM as a single operation, either in source or object form. The DBM can perform internal retrievals, returning only the subroutine's output to the host. Using the above example, a subroutine X would be catalogued (in the DBM) which retrieves each qualifying account and stor~s it back with the, 203 "inactive" indicator set. The only interaction between the host and the DBM is the command to execute X and status returned upon completion. An additional benefit of this approach is the opportunity for the DBM to optimize the execution of X since it "sees" the entire collection of database operations instead of individual data language statements. CASSM is an example of a DBM which supports catalogued subroutines. There are several data language features that could be included if the aim were to minimize communication. Most of these motivate or force the user to express at a high level what is to be ultimately accomplished. They cause the language to be less procedural, or supplement procedural sequences with non-procedural declarations. We point out in passing that the cost of inter-task communication in a typical mainframe operating system is surprisingly high, so that even in a conventional software database environment there is a strong motivation to reduce the traffic between the application task and the database task. Another independent motive is fueled by the advent 'of hierarchies of storage, which are inevitable if very large databases are to be addressed in the context of foreseeable price/performance trends for different types of secondary storage; no single device is expected to emerge both cheaper and faster than any other device (see Figure 10). It has been argued that storage hierarchies will be more effective if the data staging algorithm can anticipate in advance exactly what data will be required [34]. This again relies on a language through which the user can express with some refinement his data needs. High level, set-oriented statements, intention declarations and the like would all marry quite well with an intelligent data staging mechanism. Thus it is the broad direction of computer technology encompassing distributed processing, storage hierarchies, and software engineering, and not just the DBM which calls for a reassessment of data language standards efforts. Set oriented vs. record oriented processing The DBM concept most directly and vividly exposes the relationship between a data language and the hardware mechanism which ultimately does the work. In previous sections it has been established that conventional computer architecture is not particularly well suited for database management, that dramatic improvements in cost/performance can be achieved with fundamentally new approaches. In nearly every proposed architecture, be it oriented to searching, sorting, list merging or 'the like, there is a common theme: one or more sets of data are operated upon to produce another set. This is no accident since the basis for the claimed economy is parallel processing, that is, many small inexpensive processors working effectively together to do a large job quickly. The opportunity to exploit parallelism practically depends on the ability to define operations in terms of sets instead of individual points of data. This in turn clearly depends on the ability to deal with sets of data at the level of the data languages itself. In the world of scientific computing, scalar ori~nted lan- From the collection of the Computer History Museum (www.computerhistory.org) 204 National Computer Conference, 1980 /360/165 13033-2-1 0:: LLJ I,CXXJ .2305 0-_ WOO CCD'Se 1-0:: >-« m..J «...I (!)o BUBBLESe LLJo :Ez 0- ...I « z NEW I0:: ~30-1 3330-11 3350 8350 ...... 8800 0::LLJ LLJ 2314 .10 1 10 2 10 3 4 5 6 DAS~ 7 3850"8 9 10 10 10 10 10 10 10 CAPACITY PER ACCESSES PER SECOND 10 10 lOll 12 10 Figure 10-Trends in online storage-future product directions. guages like FORTRAN have been enhanced with high level array operations so that, for example, matrix inversion or multiplication can be expressed as a single statement. This enhancement is motivated not so much by software engineering principles as the industry's ability to build highly parallel machines that operate on arrays at blinding speeds. If matrix multiplication can only be expressed as a sequence of DO, IF and assignment statements, how can the underlying system figure out what the programmer intended? How can the advanced architecture be exploited? Likewise if a database programmer cannot express a predicate as a predicate ("find all accounts for which no credits or debits have been posted during the last 12 months" paraphrased as a single data language statement)" but must restate it procedurally with more primitive record oriented statements embedded in loops, how can set oriented DBM's like RAP, RARES, or CASSM be effectively exploited? In the past, set oriented data languages, sometimes (incorrectly) called "relational" languages, have been regarded as powerful but impractical-too expensive to implement and operate. The lower level record oriented ! languages , including the CODASYL DML, have scored high points for feasibility and economy. The emergence ofDBM technology may actually reverse this situation in the next few years. In view of this, language developers working with the CODASYL basis should work out ways of enhanbing the DML with set oriented operations. Not only will this result in a better fit with the DBM, but also with the trends in user requirements (people productivity), mass storage technology and distributed databases. There is an obvious counterargument. If users rarely need to manipulate matrices, then fancy scientific computers should be built for the few and FORTRAN for the masses shouldn't be affected. Likewise if very few users need to manipulate sets of data, but rely mainly on sequential access or simple direct access (' 'find the unique account record with key account number 745286"), then set oriented machines will not have broad appeal. We strongly believe that although there will always be a need for record oriented access to data, there is also a great demand for set oriented capabilities. Moreover this demand can only increase as databases come to be regarded as information resources for management. V. SOME TECHNICAL ISSUES ON DATABASE MACHINES The following is a collection of key technical issues which must be addressed by researchers in database machine technology. The discussionis broadly grouped into three areas: basic technology, hardware architecture, and software architecture. Basic technology The use of the systems described in Section III will depend heavily on cost, performance, storage capacity, and reliability of such solid-state devices as LSI processors, RAMs, CCDs, and bubbles. DBM architects will be structuring systems which incorporate such large volumes of these devices that reliability will dominate the design of products. Researchers are only beginning to realize that solid From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards state devices are not just "electronic" disks. Bubbles and CCDs provide unique opportunities for combining logic with storage as demonstrated in IBM's bubble query machine, RAP.2, etc. The main manufacturing problems for research and development are: 1) High density storage media Texas Instruments introduced in 1976 the TIB 0101 bubble chips with lOS bits/chip at 108 bits/in2 density (6 /-lm bubble diameter), and in 1978 the TIB 0103 bubble chips with 2.56 x 105 bits/chip at 4 x 106 bits/in2 density (3 /-lm bubble' diameter). A simple board of 4" x 4" in area containing 1MB bubble memory module as well as all semiconductor components has already appeared [25]. Research work on 1 /-lm and even 0.5 /-lm bubble diameter materials (potentially up to 106 bits/in2 density) have been reported by IBM Research. The manufacturers must get ready to build devices using such materials. Investigators will continue their search for materials sustaining even smaller bubbles. Alternatively, the engineers may invent and implement device structures capable of higher densities (e.g. bubble lattices) than conventional structures (e.g. half disk types used in TIB 0303) at the same bubble diameter. Similar advances in design are taking place in LSI semiconductor devices. One example is TI's three-dimensional MOS RAM cell design in 1978 that reduces area, power, and refresh requirements. Also, several new semiconductor materials are being discovered, such as Galium Arsenide, that reduce area and power requirements. 2) High resolution lithography Bubble chips entered the market using high-resolution photolithography (in fact, close to the limit of its capability). Electron beam lithography will reduce the line width by at least another order of magnitude. When used with smallbubble materials or various semiconductor devices, it will enable bit density increase by two orders of magnitude. Again, clever device structure (e.g. contiguous disks or three dimensional MOS devices) achieves higher device density at a given lithography capability, thus providing an alternative to high-resolution lithography. 3) Packaging Packaging considerations can have a large impact on cost, speed, and reliability. Cost, speed, and reliability have and will continue to be substantially improved by putting more devices on a chip. Improvements in device design, better yields to allow larger chips, and higher resolution lithography are increasing the number of devices on a chip at such a drastic rate that it is difficult to comprehend. However, to exploit this requires equally drastic architectural approaches to insure that the number of LSI is minimized. The simpleminded approach of integrating more of the conventional architectures on a chip usually increases the number of pins 205 per chip beyond cost-effective technological limits (currently about 40 pins per chip). Two approaches can be taken to improve the situation. One approach is to reduce the cost of more pins per chip. Another approach is to reduce the number of pins per chip using a different architectural approach. Many improvements have been made or proposed to reduce the cost of more pins per chip. Gang bonding and film carrier techniques allow more of the packaging of chips to be automated with improved reliability. Also, putting multiple chips on a single substrate can reduce the cost of packaging. Another technique called wafer-scale integration (WSI) can potentially avoid much of the packing costs by interconnecting the chips directly on the original wafer. Bad chips are removed using laser trimming or using dynamic diagnostic algorithms to locate and electronically disconnect bad chips. The dynamic approach has the advantage that it can be applied to remove chips that have gone bad in installed equipment. Alternatively, new architectures can cluster hardware onto chips in ways that reduce the number of pins per chip as well as simplifying the interconnection among chips. The cellular-logic devices described in Section III use a onedimensional array, a tree, or a network. A one-dimensional array requires the fewest pins per cell because each cell need only communicate to its two adjacent cells. Also, the number of pins per chip is independent of the number of cells per chip. This allows the drastic increase in devices per chip to be directly exploited without increasing the number of pins per chip. For example, if one cell per chip requires 16 pins, then 100 cells per chip would require only 16 pins. This advantage also carries over to larger packages, such as printed circuit boards, mUltiple chip package, and wafer-scale integration. No other topology has this property. All others must increase the number of pins per chip as more cells are integrated into one chip. In order to exploit this advantage, however, the memory and processor of each cell must be compatible technologies, so that they can be packaged (or preferably processed) together. Various semiconductor memory technologies have very compatible logic technologies. Also, magnetic bubble logic shows great promise for exploiting bubble memories. Disc and tape memories, however, have no compatible logic technologies. The industry has already paid attention to board compatibility and voltage compatibility of bubble components with semiconductor components. Some remaining problems for bubbles with major improvement potentials are mUltiple-chip packaging, replacement of external bias magnets by on-chip bias, replacement or simplification of the external driving coils, and further development of bubble logic. 4) System innovation The hardware problems are reasonably well defined and being pursued. The system problems are desperately in need of innovation, discipline, and interaction with hardware know-how. There have been enough scattered conceptual explorations of bubble device capabilities (e.g., a variety of device structures for Boolean logic, text editing, data man- From the collection of the Computer History Museum (www.computerhistory.org) 206 National Computer Conference, i980 agement, sorting, associative search, etc.). Evaluation of the feasibility of these devices is lacking. No serious commercial impact is foreseen without the development of a few (indeed very few) basic chip types encompassing a collection of un iversal functions. System assessments are equally lacking. Detailed designs to include system performance evaluation and software requirements are needed to demonstrate the advantages of the innovative hardware designs. As usual, a multi-disciplinary area tends to become a no-man's land. Only simply problems such as simulation and performance evaluation of bubbles and CCD's as gap fillers have been examined, probably over-worked. Tomorrow's DBM's will depend heavily on both loosely and tightly coupled inter-processor architectures. Communication considerations will begin to dominate price and performance. Realization of DBM architecture will depend heavily on progress in this area. The design of special purpose LSI devices to fit DBM idiosyncrasies will depend heavily on cutting design and engineering costs for such devices. If costs continue to run high, the DBM implementors will have to structure their thinking toward utilization of more conventionally organized memory and microprocessor components. 5) Technology and standardization Standardization usually comes after developments in products have been done, not before. However in the age of very large scale integration (VLSI), when design cost overshadows manufacturing cost (e.g., see Moore [39]), it would make great sense for the users to indicate what they want to see in the hardware. By adjusting their requirements to the manufacturing constraints of hardwares, they may forecast the standards before the product development, both for user convenience and for manufacturing cost reduction. Let us clarify the issues by considering a specific technology-magnetic bubbles. At present, bubble memory modules with capacity ranging from 92kb to 1Mb are available commercially. Certainly, the technology is mature enough to consider standardization issues. In the U. S. A. , bubble products are marketed by Texas Instruments, INTEL, Rockwell International, and National Semiconductors, and also produced by Western Electric and other companies for internal use. In Japan, Fujitsu, Hitachi, and NEC are manufacturing bubble modules as commercial products (see Yamagishi [53]). Certainly, there are enough manufacturers to make standardization issues relevant and urgent from the user's viewpoint. Moreover, steady improvements of device density and chip capacity have been predicted, and various functional enhancements have been proposed. Certainly, the technology will undergo highly dynamic evolutionary stages and need standardization to prevent unbridled developments. The maturity of manufacturing technology will encourage the pursuit of associative search, sorting, data management, simple Boolean logic, etc. (see Chang [63]). Although the detailed device configurations must await the gradual hardware evolution, the terminal characteristics of the chips of concern to the users could be responsive to the users, and early interactions between the manufacturers (or their forerunners-the researchers and developers) and the users will be worthwhile. Some proposals for standardization may be a reasonable way to initiate the dialogues. Hardware architecture 1) Clearly, the proper mix of families of device architectures and speeds will be a major concern of DBM technologists in the '80's. Because of the expense of prototyping such systems, there will be a heavy reliance on modeling . and performance evaluation simulations. 2) The need to define logical interfaces and protocols for I/O architectures will become a dominant theme in the '80's [38]. This will be required so that the systems can more easily incorporate various DBM components into integrated systems to meet user application needs. One can anticipate the same controversies to arise in this area as have occurred in communication and networking standardization efforts. 3) The success of category 1 and category 4 DBMs will depend heavily on being able to optimize their usage in broad application environments. For example, they appear to be most cost effective where searching requires complex relationships be satisfied on secondary keys and when mUltiple records respond to such requests. This feature is expected to become more important in the future when applications are hypothesized to rely heavily on on-line queries. Nevertheless, these devices will have greater applicability if they can also efficiently search for single records. The ability to handle many data types of varying lengths would also broaden their market. 4) The protection mechanism required by databases to control concurrency, security, integrity, and recovery have barely been considered by workers in DBM technology. This is often passed off as a software problem. A fruitful area for DBM researchers will be in designing DBM architecture to support these functions. The inherent speed of associative processors indicates that enforcement of protection rules may become one of their primary functions. Software architecture 1) Because database machines will incorporate many diverse processors, bulk memories, and intelligent memories with varying price, performance, and capacity, an extensive amount of work will continue to be needed in studying data clustering, partitioning, staging, and virtual memory strategies for files. Magnetic disks are not likely to disappear in the '80's. Also, other low price/bit large file technologies may come of age in the '80's, e.g., laser video disks and EBAM. They will be used to store the majority of on-line data. Accessing strategies will continue to optimize resources by attempting to minimize the number of disk accesses required to complete an operation. Algorithms that use intelligent controllers and associative memories will be sought to improve access for these bulk memories. 2) An important contribution that is needed to unify database machine research will be the identification of com- From the collection of the Computer History Museum (www.computerhistory.org) Database Machines and Issues on DBMS Standards monality and compromise between the individual requirements of text, formatted files, signal, graphic, and map databases. 3) An important issue raised in the past is whether or not database machines should be user programmable. That is, should software be provided to allow users to code data processing and systems programs or should the system limit itself to the execution of database management functions. Precluding the ability to run machine or compiled code will eliminate many of the mechanisms or avenues that allow database security and integrity breaches today. It will also increase the designer's degree of freedom in customizing the DBM for its intended function. 4) The collection and dissemination of user statistics relating to query complexity, file characteristics, locality of database access, etc. ,are currently non-existent. Without this data, researchers can only hypothesize the relative importance of various architectural tradeoffs. We cannot deliver good solutions until the problems are well understood and parameterized. On the other hand, we cannot parameterize user statistics until we deliver good solutions. Users adapt to whatever system is available. Any statistics gathered from existing systems is only valid pastboundand may not have any resemblance to the future. Improvement must be made iteratively. Because of improvements in hardware, new and improved system strategies will be developed and used. This will, in turn, provide feedback to aid in further hardware improvements. data models, and data languages. Database machines can make it very cost-effective to support high-level data models and data languages which are necessary for improving user/ programmer productivity and to support multi-schema DBMS architectures which are necessary for achieving data independence. The existing database machines have demonstrated their capabilities to make data mapping between schemas a simpler task and to support the existing data models with considerable improvement in cost/performance. Furthermore, database machines are particularly suitable for supporting high level, non-procedural, and set oriented data languages. Thus, we should establish a standard DBMS architecture or a data model based on user benefits and assume with confidence that the performance gap will gradually close up. High level, non-procedural and set oriented operations which score high in both user productivity and technology considerations should be incorporated in a standard data language. REF~RENCES D. K., and Kannan, K., "DBC-A Database Computer for Very Large Databases," IEEE Transactions on Computers, Vol. C-28, No.6, June 1979. Batcher, K. E., "STARAN Series E," Proc. 1977 International Conference on Parallel Processing, Aug. 1977, pp. 140-143. Baum, R.I., Hsiao, D. K., and Kanan, K., "The Architecture of Database -Part I: Concepts and Capabilities," The Ohio State University Technical'Report No. OSU-CISRC-TR-76-1, (September, 1976). Berra, P. B. and Oliver, E., "The Role of Associative Array Processors in Data Base Machine Architecture," Computer, Vol. 12, No.3, March 1979. Canady, R. H., Harrison, R. D., Ivie, E. L., Ryder, J. L., and Wehr, L. A. "A Back-End Computer for Database Management," Communications of the ACM, 17, 10, (October 1974), pp. 575-582. Chang, H., Magnetic Bubble Memory Technology, Marcel Dekker, 1978. Chang, H., "On Bubble Memories and Relational Data Base," Proc. 4th Int'l Conf. on Very Large Data Bases, Berlin, Sept. 13-15, 1978, pp. 207229. Codd, E. F. and Date, C. J., "Interactive Support for Non-Programmers: The Relational and Network Approaches," IBM Research publication RJl400, San Jose, June 1974. Computer, Vol. 12, No.3, March 1979. Copeland, G. P., "String Storage and Searching for Data Base Applications: Implementation on the INDY Backend Kernel," Proc. Fourth Workshop on Computer Architecture for Non-Numeric Processing, SIGARCH SIGIR SIGMOD, Aug. 1978, pp. 8-17. Copeland, G. P., Lipovski, G. J., and Su, S. Y. W., "The Architecture of CASSM: A Cf'!lIular System for Non-numeric Processing," Proc. 1st Annual Symposium on Computer Architecture, Dec. 1973, pp. 121-128. Davis, E. W., "STARAN PaFallel Processor System Software," AFIPS Conf. Proc., Vol. 43, 1974 NCC, pp. 16-22. DeFiore, C. and Berra, P. B., "A Data Management System Utilizing an Associative Memory," AFIPS Conf. ProC'. Vol. 42, 1973 NCC, pp. 181185. DeFiore, C. R. and Berra, P. B.• "A Quantitative Analysis of the Utilization of Associative Memories in Data Management," IEEE Trans. Computers, Vol. C-23, No.2. 1974. pp. 121-132. DeWitt. D. J., "DIRECT-A Multiprocessor Organization for Supporting Relational Data Base Management Systems," IEEE Transactions on Computers, Vol. C-28, No.6, June, 1979, pp. 395-406. Fisher, P. S. and Maryanski, F. J., "Design Considerations" in Distributed Data Base Management Systems, TR CS 77-08, Dept. of Computer Science, Kansas State University, Manhattan, Kansas 66506, April 1977. Freen, R., "A Partitioned Data Base For Use With a Rational Associative Processor," M. S. Thesis, Department of Computer Science, University of Toronto, December 1977. 1. BaneJjee, J., Hsiao, 2. 3. 4. VI. CONCLUSION 5. What impact do hardware technologies and database machines have on the database management area? The answer is: They are all making data processing less expensive and more accessible (to both large and small users). The lowcost, computational, logic and control capabilities have already made microprocessors ubiquitous. Bubbles and CCD's offer modular storage coupled with data storing, arranging and managing capabilities. Their impact will be twofold: First, they will extend database management capabilities to smaller data collections for smaller users in smaller machines. Second, they will be useful in large database systems as nodes in a network, as servers, and as componerits- amenable to parallel operations. Advances in database machine technology will be required to solve many database management system problems so that the promise of the database gospel can be delivered to users. Progress toward producing these machines will depend heavily on the improvements in price/performance of basic memory and processor technologies. A better understanding of the partitioning of the total problem will also aid special device development. The trend will be toward defining integrated database machines. Thus, workers in this area will find it necessary to have a good understanding of database application and software issues, as well as hardware architecture and technology issues. The advances in DBM technology will not only have great impact on the implementation of DBMS software but also have profound effect on the designs of DBMS architectures, 207 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. From the collection of the Computer History Museum (www.computerhistory.org) 208 National Computer Conference, 1980 18. Hakozaki, K., et aI., "A Conceptual Design of a Generalized Database Subsystem," Proc. of the 3rd Int'l. Conf. on Very Large Data Bases. Oct. 1977, pp. 246-253. 19. Housh, R. D., "A User Transparent Distributed DBMS," Masters Report, Dept. of Computer Science, Kansas State University, Manhattan, Kansas 66506. 20. Hsiao, D. K. and Kanna, K., "The Architecture Of A Database Computer -Part II: The Design of Structure Memory And Its Related Processors," The Ohio State University, Tech Rep. OSU-CISR-TR-76-3 (December 1976). 21. Hsiao, D. K. and Kannan, K., "The Architecture Of A Database Computer-Part III: The Design Of The Mass Memory And Its related Components," The Ohio State University, Tech. Rep. OSU-CISRC-TR-76-3 (December 1976). 22. Hsiao, D. K., Kannan, K., and Kerr, D. S., "Structure Memory Designs For A Database Computer," Proceedings of ACM 77 (October 1977). 23. Hsiao, D. K., Kanan, K., and Kerr, D. S., "Structure Memory Designs for a Database Computer," Proc. ACM 1977, Dec. 1977, pp. 343-350. 24. IEEE Transactions on Computers, Vol. C-28, No.6, June, 1979. 25. INTEL Corp., "INTEL Magnetics Bubble Memory Design Handbook," May 1979. 26. Jeffery, S. and Berg, J. L., "Developing a Strategy for Federal DBMS Standards," Tenth Annual Conf., Society for Management Information Systems, Washington, D. c., Sept. 18-20, 1978. 27. Jeffery, S., Fife, D., Deutsch, D., and Sockut, G., "Architectural Considerations for Federal Database Standards," Spring COMPCON 79, San Francisco, Calif., Feb. 26-March 1, 1979. 28. Kannan, K., Hsiao, D. K., and Kerr, D. S., "i.. Microprogrammed Keywork Transformation Unit For A Database Computer," Proceedings of MICRO-lO Conference, October 1977. 29. Kuck, D. J., "ILLIAC IV Software and Application Programming," IEEE Transactions on Computers, Vol. C-17, No.8, August 1960. 30. Lin, C. S., Smith, D. C. P., and Smith, J. M., "The Design of a Rotating Associative Memory for Relational Data Base Applications," ACM Trans. Database Systems, Vol. 1, No.1, 1976, pp. 53-65. 31. Linde, R., Gates, R., and Peng, T. F., "Associative Processor Applications to Real-time Data Management," AFIPS Conference Proceedings, Vol. 42, 1973, pp. 187-195. 32. Lipovski, G. J., "Architectural Features of CASSM: A Context 'Addressed Segment Sequential Memory," Proc. 5th Annual Symposium on Computer Architecture, Palo Alto, Calif., April 1978, pp. 31-38. 33. Lowenthal, E. I., "The Backend Computer, Part I and Part II," Auerbach (Data Management) Series, 24-01-04 and 24-01-05 1976. 34. Lowenthal, E. I., "A Survey: The Application of Data Base Management Computers in Distributed Systems," Proceedings of the Third International Conference on Very Large Data Bases, Tokyo, October 1977. 35. Madnick, S. E., "INFOPLEX-Hierarchical Decomposition of a Large Information Management System Using a Microprocessor Complex," Proc. 1975 NCe, Vol. 44, AFIPS Press, Montvale, N. J., pp. 581-586. 36. Marill, T. and Stern, D., "The Data Computer-A Network Data Utility," 1975 NCC, Vol. 44, June 1975. 37. Maryanski, F. J. and Wallentine, V. E .. "A Simulation Model of a Backend Data Base Management System," Proceedings 7th Pittsburgh Symposium on Modeling and Simulation, pp. 252-257, April 1976. 38. McDonnell, K., "Trends--in Non-Software Support For Input-Output Functions," Proc. of the 3rd Workshop On Computer Architecture for Non-Numeric Processing, May 1977 40-47. 39. Moore, G., "VLSI: Some Fundamental Challenges," Spectrum, Vol. 16, no. 4, April 1979. 40. Moulder, R., "An Implementation of a Data Management System on an Associative Processor," AFIPS Conf. Proc. Vol. 42, 1973 NCC, pp. 171176. ,41. Ozkarahan, E. A., Schuster, S. A., and Smith, K. c., "RAP-An Associative Processor for Data Base Management," AFIPS Conf. Proc. 1975 NCC, pp. 370-387. 42. Rosenthal, R. S., "An Evaluation of a Backend Data Base Management Machine," Proceedings of the Annual Computer Related Information Systems Symposium, U. S. Air Force Academy, 1977. 43. Rudolph, J. A., "A Production Implementation of an Associative Processor: STARAN," AFIPS. Conf. Proc. 1972 FICC, Vol. 41, Part I, pp. 229-241. 44. Schuster, S. A., Ozkarahan, E. A., and Smith, K. C., "A Virtual Memory System for a Relational Associative Processor," Proc. Nat. Computer Conf., 1976, pp. 855-862. 45. Schuster, S. A., Nguyen, H. B., Ozkarahan, E. A., and Smith, K. C., "RAP .2-An Associative Processor for Databases and Its Applications, " IEEE Transactions on Computers, Vol. C-28, No.6, June 1979, pp. 446458. 46. Slotnick, D. L., "Logic per Track Devices," in Advances in Computers, Academic Press, 1970, pp. 291-296. 47. Su, S. Y. W., "Cellular-logic Devices: Concept and Applications," Computer, Vol. 12, No.3, March 1979, pp. 11-25. 48. Su, S. Y. W., Copeland, G. P., and Lipovski, G. J., "Retrieval Operations and Data Representations in a Context-addressed Disc System," in Proceedings of ACM's SIGPLAN and SIGIR Interface Meeting, Nov. 1973, pp. 144-156. 49. Su, S. Y. W., Nguyen, L. H., Emam, A., and Lipovski, G. J., "The Architectural Features and Implementation Techniques of the Multicell CASSM," IEEE Transactions on Computers, Vol. C-28, No.6, June, 1979, pp. 430-445. 50. Su, S. Y. W., "Associative Programming in CASSM and its Applications," Proc. of the Third International Conference on Very Large Databases, Oct. 6-8, 1977, pp. 213-228. 51. Su, S. Y. W., Lupkiewicz, S., Lee, C. J., Lo, D. H., and Doty, K., "MICRONET: A Microcomputer Network System for Managing Distributed Relational Databases," Proc. of the 4th International Conference on Very Large Data Bases, Berlin, Germany, Sept. 13-15, 1978. 52. Tsichritzis, D. and Lochovsky, F., Data Base Management Systems, Academic Press, 1977. 53. Yamagishi, K., "The Progress of Magnetic Bubble Development in Japan," Proc. 3rd U.S.A.~lapan Computer Conference, October, 1978. From the collection of the Computer History Museum (www.computerhistory.org)