Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Concurrency control wikipedia, lookup

Microsoft SQL Server wikipedia, lookup

Open Database Connectivity wikipedia, lookup

Microsoft Jet Database Engine wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

Clusterpoint wikipedia, lookup

Functional Database Model wikipedia, lookup

Relational algebra wikipedia, lookup

Transcript

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-lI, NO. 10, OCTOBER 1985 1071 Statistical Database Query Languages GULTEKIN OZSOYOGLU, MEMBER, IEEE, AND Abstract-Databases that are mainly used for statistical analysis are called statistical databases (SDB). A statistical database management system (SDBMS) may be defined as a database management system that provides capabilities 1) to model, store, and manipulate data in a manner suitable for the needs of SDB users, and 2) to apply statistical data analysis techniques that range from simple summary statistics to advanced procedures. This paper surveys the existing and proposed SDB data definition and data manipulation (i.e., query) languages. Index Terms-Database systems, data definition, data manipulation, query languages, statistical databases. I. INTRODUCTION DATABASES that are mainly used for statistical anal- Jysis are called statistical databases (SDB). Various SDB application areas include health care, census data evaluation, economic planning, and management decision making, among others. A statistical database management system (SDBMS) may be defined as a database management system that provides capabilities 1) to model, store, and manipulate data in a manner suitable for the needs of SDB users, and 2) to apply statistical data analysis techniques that range from simple summary statistics to advanced procedures like discriminant or factor analysis. For simple summary statistics, the SDBMS is expected to have powerful, easy-to-use, and efficient data aggregation features. On the other hand, for more advanced statistical data analysis, the SDBMS provides interface to statistical analysis procedures, which is either transparent to users or produces explicit output data ready to be input to statistical analysis procedures. Most of the database management systems currently available in the commercial market are designed for business data processing environments. The primary goals of these so-called "corporate" database management systems (CDBMS) are to improve the productivity of application programmers and to. facilitate easy data access by naive users [24]. However, CDBMS's are not widely used in SDB application areas primarily because the conceptual and internal modeling tools, and query languages that they provide do not meet SDB users' needs. For example, data aggregation features of CDBMS's are add-on, ad hoc, and usually inefficient. Traditionally the data management needs of SDB users have been met by restricted data management capabilities ZEHRA MERAL OZSOYOGLU of statistical packages and file management systems plus customized application programs. Another approach'has been either to extend or'modify the capabilities of an existing CDBMS to accomodate an SDB application or to build a new SDBMS. What distinguishes an SDBMS from a statistical package? As more and more data management capabilities are introduced into statistical packages (such as a B + tree organization in P-STAT [4] or new data manipulation commands of SPSS-X [14], and as the software of statistical packages become more and more integrated it is important to have some criteria to distinguish SDBMS's from statistical packages. We think that having the emphasis on proper data management tools for SDB users such as the availability of conceptual modeling tools, query languages and rich internal (physical) modeling constructs, rather than an emphasis on advanced statistical analysis procedures, is a good criteria to distinguish an SDBMS from a statistical package software. Using this criteria, in this paper, we distinguish existing and proposed SDBMS's in the literature, and examine their SDB query (i.e., data definition and data manipulation) languages. In Section II, we list the criteria used to evaluate SDB. query languages. Section III gives a taxonomy of proposed and existing SDBMS's, and Sections IV and V examine SDB query languages according to the taxonomy given in Section III. Section VI discusses languages designed to manipulate summary table, an object commonly used by SDB users. Section VII contains the concluding remarks. II. EVALUATION CRITERIA FOR SDB QUERY LANGUAGES 'Data modeling and manipulation capabilities of SDBMS's are developed according to the operational use of data by users. For example, during the exploratory data analysis phase [57] users deal with representative, interpreted, cleaned or experimental subsets of data. The special utilization cha-racteristics of SDB's necessitate incorporation of extensive metadata capabilities and new objects such as summary tables, matrices, and scatter diagrams [33], [37], [42], [46], [49], [50]. Therefore in our survey of SDB query languages, we evaluate (to the extent possible) specific data and metadata definition capabilities such as Manuscript received February 15, 1985; revised June 5, 1985. This work * the objects definable by the language,, under Grant MCSNational Science Foundation the was supported by * data descriptors (units of measure, scale, missing 8306616. Z. M. Ozsoyoglu was supported by an IBM Faculty Development Award. values, data quality information, universe description), The authors are with the Department of Computer Engineering and Sci* footnotes, ence, Case Institute of Technology, Case Western Reserve University, Cleveland, OH 44106. * keywords, 0098-5589/85/1000-1071$01.00 © 1985 IEEE 1072 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-li, NO. 10, OCTOBER 1985 * textual description and historical data, * editing specifications and data structuring capabilities, and specific data manipulation capabilities such as * aggregation capabilities, * subsetting and sampling, * metadata manipulation, * handling time explicitly, * historical data. In all CDBMS's, aggregate functions such as SUM, MAX and AVE are incorporated in an ad hoc manner into the associated query language. The relational algebra and the relational calculus query languages introduced by Codd [9] do not formally incorporate aggregate functions. Since aggregation operations are extremely frequent in statistical databases, query languages of most SDBMS's (such as STRAND [25], GENISYS [34], SSDB [43] and others) provide powerful and/or user-friendly aggregation capabilities. Summary tables, tabular representations of summary (aggregated) data, are so important that almost all statistical packages provide some form of limited summary table output formatting capabilities. There are also some SDBMS's (such as HSDB [21] and STBE [44]) that provide summary table manipulation languages. Section VI surveys summary table manipulation languages. Execution of an advanced statistical analysis procedure usually requires a long set of parameters to be initialized -by a set of syntactically complex commands. There are three ways of interaction between an SDB query language and a statistical analysis procedure. One approach is to embed a specific statistical procedures library into an SDBMS and develop syntactically simpler and easy-to-use capabilities in the query language to execute statistical analysis procedures (alternative one). This may be viewed as a master-slave approach in which the SDBMS dictates the execution of statistical analysis procedures, and the interface between the SDBMS and the library procedures is transparent to the user. This approach has been criticized as being inflexible since it does not permit the users access different statistical analysis procedures in different packages. In another approach, users specify the execution of a procedure from a specific package in their query, the SDBMS prepares the input to that procedure, and the user later initiates the package procedure execution (alternative two). This entails on the part of SDBMS the capability to produce commands of a specific statistical package. In the third approach the SDBMS produces a flat table of data needed for the execution of a statistical package, and the users are responsible for creating the package commands and for initiating the package execution (alternative three). We will comment on the type of statistical package interaction in SDB query languages. It is important to define the expressive (manipulative) power of a language with respect to an object since such a definition unambiguously defines what type of manipulations are and are not achievable by the language. Therefore, whenever possible, we will specify the expressive power of an SDB language. We will also comment on the ease of use, syntax and functionality of an SDB query language. III. THE TAXONOMY OF SDBMS's In this paper we examine SDB query languages of the following systems. 1) SDBMS's Built on Top of CDBMS's: Majority of the CDBMS's in this category are relational systems. Examples are HSDB (on Model 204 [6]) [21], Ghosh's extensions to SQL [15], System/K (on SQL/DS [22]) [32], and STRAND (on Ingres [51]) [25]. Another approach is to use a Generalized Interface System that links together available CDBMS's, statistical packages and graphics software using a single high level language. Examples are PASTE [60], SIBYL [18], and GPI [19]. 2) Separately Developed SDBMSs': Below we further categorize these systems by the data model and query language they use. a) Relational Data Model and Relational Query Languages: These systems use new internal (file) organization techniques and/or additional conceptual modeling tools and/or well-defined aggregation functions. Examples include RAPID [56] and CAS SDB [31] which use relational algebra; ABE [28] which uses relational calculus; SIR/SQL [1], GENISYS [34], and CANTOR [26] which use SQL [10] and JANUS [27] which uses relationlike objects and relational algebra-like operators. b) Network Data Model: An example is SIR/DBMS [1]. c) Formally Extended Relational Model and Relational Algebra/Calculus Languages: Examples are SSDL [3], Klug's work [29], [30], and extensions of Ozsoyoglu et al. [40], [43]. d) Graphical User Interfaces: Examples are SUBJECT [7], GUIDE [61], ABE [28], STBE [44], SEEDIS On-line Codebook [36], and ALDS Data Editor [54]. IV. QUERY LANGUAGES OF SDBMS's BUILT ON ToP OF CDBMS's The query language STRAND [25] is developed as an ER-model [8] query language. STRAND expressions are translated into QUEL statements, the query language of the relational system INGRES. STRAND is a derivative of CABLE [48], and it lacks data definition language (DDL) features. The main advantage of STRAND is to allow aggregate query formulations in an easy manner when the query involves a chain of entity sets (i.e., relations) in the ER-model. In such a case it is sufficient to specify the entity sets by marking the beginning and the end of a chain of, say, n entity sets. Then the system performs the n-way join using the relationships between the entity sets in the chain (called the chaining operation). The only other operations are projection and restriction on entity sets (which are identical to the projection and restriction of the relational algebra) and summarization OZ$OYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES (aggregation) on iteratively aggregated entity sets (called summary sets). STRAND is not relationally complete (i.e., its expressive power of manipulating relations is less than that of relational algebra as defined by Codd [9] since there are no set union, set difference and set intersection operators; and it can only be used with tree-structured ER-models since existence of more than one path between two entity sets creates query processing ambiguity. Also there are no time and metadata handling capabilities in STRAND. Because relations are produced by STRAND, alternative three can be used as interface to statistical packages. HSDB [21] is an SDBMS implemented on top of the relational system Model 204. HSDB has extensive data descriptors such as the discrimination between discrete/ continuous values (or the original source), missing values, the unit, the precision, the theoretical distribution, and summary statistics about the set (or bag) of values in a column of a relation. In addition, HSDB retains metadata information about derived data as to when it is derived, who has derived it and the formula used for derivation. Also, single-column relations can be created by specifying the tuple component values in various ways. In addition to the relations, HSDB maintains summary tables and provides a limited set of summary operations (see Section VI). For statistical analysis, alternative one is chosen, i.e., the query language contains a set of statistical analysis procedures as operations on relations or summary tables. For security, access control commands with limited power are provided for each relation and summary table. Ghosh [15] extends Codd's relational model with a new object, called the statistical relational table (SRT) (which is identical to a primitive summary table; see Section VI), and proposes a set of extensions to the SQL language to create an SRT from relations, to select a smaller SRT from a given SRT, to further aggregate the information in an.SRT, to implement statistical sampling techniques such as stratified sampling or systematic sampling, and to implement statistical data analysis procedures such as time series analysis or curve fitting (i.e., alternative one to statistical data analysis is proposed). System/K [34] is an "object-oriented" knowledge base management system and is built on top of SQL/DS. Although System/K lacks majority of SDBMS characteristics, it has extensive metadata management capabilities and an interface to user-specific languages (i.e., a User Specialty Language Interface). SIBYL [18], PASTE [60], and GPI [19] are examples of systems that use a CDBMS, statistical packages and graphics software available off-the-shelf to create a software system for managing data. In addition to the query language of the underlying CDBMS, these systems provide data restructuring capabilities and commands to browse, update and extract data. SIBYL is a system that manages time series data. It uses the relational system Model 204 as the CDBMS, a database template (for mapping the logical structures of a time series database into the Model 204 structures), a set of procedures for brows- 1073 ing, updating and extracting data, and statistical packages. GPI has a "customizer" software for tailoring a statistical package to access a specific data/file structure, and a "dictionary" for describing data/file structures. The general approach in PASTE is to let the users 1) write their application programs using the commands of statistical packages and/or the query languages of a CDBMS, and 2) for each data transformation between different systems (where a system may be a statistical package, a CDBMS or a graphics package), produce PASTE commands to handle the transformation. V. QUERY LANGUAGES OF SEPARATELY DEVELOPED SDBMS's A. Relational Model-Based SDB Query Languages Systems in this category use relations as data modeling tools and an algebra or calculus-based language. RAPID [17], [56] is a relational system developed by Statistics Canada and widely used by statistical agencies in several countries. It uses relational algebra to process user queries. The main characteristics of RAPID are its very efficient execution of statistical queries (using transposed files [2]) and efficient storage utilization (using data compression by encoding). Each RAPID relation is a selfdescribing transposed file. It is self-describing in the sense that data and metadata about a relation (such as attribute names, data types, size, domain, last update date, status, etc.) are stored together in the same file. Additional metadata information is maintained in the RAPID dictionary (which is a single relation) in terms of entities and items of entities, and accessed using a special retrieval operator. An entity may be a relation, a codeset (describes the codes used for encoding), a value set (a special codeset for qualitative relation columns), and a comment. Items describe information about entities (e.g., relation items describe columns, and codeset items define the codes). RAPID is relationally complete and has an interface (alternative three) to statistical packages (e.g., SAS, SPSS) and to summary table producing systems (e.g., TPL). GENISYS [12], [34], [35] is an SDBMS that provides a relation-like view of the data in the database where a relation (an entity) corresponds to a file. Since some relation columns are allowed to contain repeating fields or ranges of values, relations may be considered to be in nonfirst normal form [62]. The query language GQL of GENISYS is a high-level SQL-like language. For computing functions, GQL has facilities for users to specify mac^rolike program fragments and to expand the fragments referenced in a query automatically into high-level program code. GQL uses predefined (by the DBA) links in a novel way to specify and execute joins among relations efficiently. These defined links are similar to those in the Link and Selector language [55]. The use of links for joins however means that two relations cannot be joined if there is no path of links among them. Or if the only path between two relations involves n relations, it requires an (n - l)-way join to join the two relations. 1074 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-ll, NO. 10, OCTOBER 1985 GQL allows users to specify aggregation of a population of values by first grouping (not partitionirg) the values using range specifiers, and then applying aggregation on each group (similar to the aggregation-by-template operator of SSDB). For example, consider the relation scheme PERSON (ID, SEX, BIRTHYEAR, BIRTHCOUNTY, DEATHYEAR). The GQL query SELECT AVERAGE (DEATHYEAR-BIRTHYEAR) BY SEX BY BIRTHYEAR (1900 ... 1954,5) BY BIRTHCOUNTY ('UTAH', 'SALT LAKE', '*') groups persons from the "main logical file" PERSON by SEX, BIRTHYEAR ranges of size 5 from 1900 to 1954, and BIRTHCOUNTY values of UTAH, SALT LAKE and * ("*" denotes "all counties"), and computes the average age at death for each group. Each BY clause is a range specifier. GQL has operations to create synonyms and comments, to display information on the coding schemes used (GEN- ISYS uses encoding for data compression) and on the links in the database. There are operations to append header information to relations and attach attribute names to columns of output relations. Undefined values are represented as null values. GQL has consistency constraint checking mechanisms (as a rules system [52]) and abstract data types for attributes which consist of other attributes. There are also facilities for users to browse through the data dictionary. ABE [28] is a screen-oriented language similar to Query-by-Example (QBE) [64]. The main feature of ABE is to use subqueries with parameters to express aggregations instead of the grouping operations (as in SQL). In SQL, grouping operations automatically eliminate empty partitions from the output; therefore, after applying aggregation to the set of partitions, the result does not contain any information about empty partitions (whereas in ABE empty partitions are retained). This is called the empty partition problem. Moreover, some nested aggregations expressible by a single ABE query cannot be expressed in SQL [28]. Therefore, in addition to being syntactically simpler and user-friendly, as far as aggregation queries are concerned, ABE is more powerful than both SQL and QUEL. ABE can express conjunctive relational queries (it is related to the relational calculus with aggregate functions [29]). However, it can not express set union, and therefore is not relationally complete. Moreover there are some other simple queries that are not expressible by ABE due to the limtations of the available query formulation constructs [45]. In ABE, queries that involve all, only, or no qualifiers (i.e., existential andiuniversal quantifications in predicate calculus), are handled by set comparison operators (i.e., set equality and set containment). This approach is simpler and more user-friendly than the semi-explicit use of quantifiers in QBE. The system CANTOR [26], designed for the analysis of large sets of data, uses an object-based data model and a data manipulation language, SAL, based on an algebra of relations. Objects are either elementary objects (e.g., integers, literals, or text) or tuples or set objects. A relation is a special set object, namely a set of tuples of the same type. Metadata maintained includes information about stored data, and consists of three relations. Each object may have the mode value (i.e., a stored object) or view (i.e., an expression that, when evaluated, forms a value). SAL queries are nonrecursive algebraic expressions in which operators are functions from operand values to result values. If the result value cannot be computed then it is "undefined" of appropriate type. The unary and binary operators include arithmetic operators (e.g., +), arithmetic comparison operators (e.g., <), arithmetic functions (e.g., SQR), and logical operators (e.g., AND) that are valid for the proper object type. Binary set algebra operators include set equality, set containment, set union, etc. All basic relational algebra operators such as restriction, generalized projection, selection (with variations like SELECTMIN, SELECTMAX), and Cartesian product are available.. A set of aggregation operators such as COMPUTE, SUM and PRODUCT are also provided. There is a partitioning operator that partitions a relation into a set of relations and applies aggregation to each relation (similar to SELECT-GROUP BY feature of SQL). Subsystems for statistical analysis remain to be designed and implemented. Another experimental system is CAS SDB [31] which has extensive metadata management facilities, an interface to the statistical package SAS, and uses a subset of relational algebra as its query language. JANUS is an SDBMS used within a large-scale data analysis and modeling system called CONSISTENT [27]. JANUS utilizes relations with set-valued attributes and null values, and a set of relational algebra-like operators. It has operations that are approximately equivalent to set theory operations and directional join (or outer join) [38]. There are also capabilities to attach information to a relation (e.g., the mean or standard deviation of the values in a column of the relation). JANUS uses the concept of links (called mappings), similar to the links of GENISYS, to specify relationships between different relations. It then uses these links to infer information among relations. JANUS has an interface to the rest of the CONSISTENT system for statistical analysis (using alternative two or alternative three). The SIR/DBMS system [1], currently being developed, provides a relational view of data on which one can superimpose any hierarchical or network views. The SIR software provides a relational query system, SIR/SQL +, which allows the user to deal only with relations. From examples in [1], SIR/SQL + has the same expressive power of manipulating relations with SQL. In addition, SIR/DBMS has facilities for 1) naming, labeling, and documenting the data in the database, 2) data quality control. (e.g., range and consistency checking, and special handling of missing and undefined data), 3) I/O security con- OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES trols, 4) a set of simple statistical procedures that include frequency distributions and histograms, descriptive statistics, scattergrams, line printer plots and simple linear regressions, 5) summary data tabulation features (see Section VI), and 6) an interface to statistical packages BMDP, SPSS and SAS by creating the input data file to these packages (i.e., alternative three). B. Network or Hierarchical Model-Based SDB Query Languages Compared to the relational model, there seem to be very few SDBMS's that use network or hierarchical data models. One of these systems proposed is SIR/DBMS which has a procedural retrieval language that enables the users to navigate the network (or hierarchical) database. The details of the navigational query language however are not clear [1]. The Table Producing Language (TPL) Descriptive Codebook system (TPLDC) [59] uses a rather unconventional way of utilizing the relationships between entity sets (where each entity set is a file). First, the database administrator forms a directed graph G of relationships between entity sets. A one-to-many relationship between entity sets A and B is represented by a directed edge from A to B. A one-to-one relationship is also represented by only one directed edge (which is an implementation decision [58]). A many-to-many relationship between entity sets A and B is represented by two directed edges, one from A to B, another from B to A. All possible rooted directed trees where a node does not have more than one "one-to-many" edge to its children (Rule R) in G are enumerated into a set S, and all trees which are subtrees of a tree in S are deleted from S to form the set V. The trees in the set V are specified by association state-ments, and the set V becomes a permanent feature of the TPLDC system. When a user wants to produce a summary table (see Section VI) from the database, he chooses a tree (called view in TPL) from the set V using the use command; selects a subset of the entities for summarization in the summary table using logical conditions that involve arithmetic operators, comparison operators, and logical operators; and then specifies the attributes (called variables) of entities to be extracted and aggregated for the summary table. The restriction introduced due to the rule R is not needed for unambiguous query specification; rather, it is an implementation restriction. There are two problems with this approach. First, the set V may be extremely large. Consider a directed clique G (i.e., a directed graph where for any two nodes A and B, there is an edge from A to B) with n nodes, in which each edge represents a "one-to-many" relationship. There are n! number of trees in V. Even when V is small there may be several trees for users to choose from. Secondly, if a user's query accesses only a few entity sets, he still has to deal with possibly large trees. However, most databases that use the TPLDC system typically have very small number of entity sets (e.g., n < 5) [58], and TPL is used in over 200 computer centers around the world [59]. 1075 C. SDB Query Languages that Utilize Formal Extensions to the Relational Model Systems in this category include SSDL [3], Klug's work on relational algebra and calculus, and algebra and calculus extensions of Ozsoyoglu et al. [40], [43]. SSDL is a high-level procedural data manipulation language that manipulates objects of type set, ordered set, vector, matrix, time, time series, text and G-relation (referred to as complex data types). All of the complex data types except G-relation are self-explanatory. G-relation (i.e., generalized relation) is an object type that is used to represent a data model called the Semantic Association Data Model (SAM*) [53]. The SAM* models the real world in terms of a set of interrelated associations: membership, aggregation, generalization, interaction, composition, cross-product, and summarization. These associations are represented by one or more G-relations. A G-relation is a relation (i.e., a set of tuples) with each column (attribute) of the relation drawing its values from a complex domain. A complex domain may be of any complex data type, including the G-relation itself. Therefore a G-relation is not in first normal form [62] since tuple components do not always have elementary-valued (atomic) data types such as integers or reals. G-relation is also different than the nonfirst normal form relations [23] (which allow only elementary values and arbitrarily nested sets of elementary values as tuple components) in the sense that it allows objects of various data types (e.g., matrix or ordered set) as tuple components. Since G-relation is recursively defined, an arbitrary number of G-relations may be nested inside a single tuple component of a G-relation. This allows for the construction of arbitrarily complex G-relations. However, internal (file) organization techniques for G-relations are yet to be investigated. Attributes of a G-relation may be distinquished as category (i.e., identifying) or summary attributes. Category attributes of a tuple qualify the summary attributes of the tuple which contain the measurements and needed values. Operators of SSDL include the usual relational algebra operators for G-relations, set theory operators for sets, linear algebra operators for vectors and matrices, and string manipulation operators for text as well as some related set theory and linear algebra operators for ordered set and time series. Since SSDL is designed to be highly procedural, it contains explicit constructs to scan tuples of a relation and to perform manipulations and conditional evaluations (similar to the for construct of Pascal), the notion of a currency pointer to retain scanning positions for nested scans and blocking constructs such as BEGINEND and DO-END. As a result, an SSDL query resembles a high-level programming language code. This approach deviates significantly from the notion of providing database users with minimal number of operators for the sake of simplicity while maintaining the expressive power of the language at a certain level. There is also some overlap of capabilities (for first normal form relations) provided by IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 10, OCTOBER 1985 1076 the relational algebra operators, and the scanning and blocking constructs. Klug [29] extends relational algebra and relational calculus by incorporating aggregate functions, and shows that the extended languages have the same expressive power. Klug's extension of the relational algebra (as defined by Codd [9]) is a new aggregation operator, called the aggregate formation, that partitions a relation (or an algebra expression evaluating to a relation) on a set of attributes X, applies an aggregate function, say SUM, to each partition, and outputs the X-value and the associated SUMvalue for each partition. As an example, consider the relation HOUSES-SOLD (HOUSE#, STATE, COUNTY, HOUSE-PRICE). The aggregate formation operation HOUSES-SOLD <STATE, SUM (HOUSE-PRICE) > returns a two-column relation with each tuple containing a state name and the sum of those prices for houses sold in that state. Relational calculus originally introduced by Codd uses an alpha expression that consists of a target list and a formula. Klug [29] (in order to dynamically define ranges for variables) extends the relational calculus by replacing the formula with range formulas and a qualifier, where a range formula itself is allowed to be a (closed) alpha expression. Ozsoyoglu et al. [40], [43] define extended relational algebra and calculus languages (mainly for manipulating summary tables-see Section VI) that utilize aggregate functions and relations with set-valued attributes (common in statistical databases), and show that the algebra and calculus languages so extended are equivalent in expressive power [40]. The extended algebra uses an aggregation-by-template operator, and pack and unpack operations for set-valued attributes. The aggregation-bytemplate operator groups tuples of a relation into (not necessarily disjoint) groups using a template relation, applies an aggregate function to each group, and returns the template value of the group and the associated aggregate value. The aggregation-by-template is more convenient than the aggregate formation when there are prespecified groupings of attributes for aggregation (common in statistical databases). Also the aggregation-by-template is based on grouping tuples (i.e., a tuple may belong to more than one group) while the aggregate formation is based on partitioning tuples. However, each aggregation operator is expressible by an algebra expression utilizing the other aggregation operator. The extended relational calculus of [40] forms the basis of a user-friendly language, called the Summary-Table-byExample [45], which is the query language of an SDBMS called SSDB [46]. STBE manipulates summary tables and relations with set-valued attributes (set-valued relations), and is similar to QBE and ABE query languages. In STBE, the user constructs an example query on the screen by fill- ing in skeletons (i.e., graphical schemes) of relations and summary tables in hierarchically arranged windows (i.e., subqueries). The hierarchical structure of subqueries in STBE (also in ABE) provides a natural way to specify aggregate functions, and also solves the empty partition problem. For query processing, STBE queries are converted into an extended relational algebra expression (of [43]), and transformed into semantically equivalent expressions which are more efficient to execute by conventional techniques [41]. D. SDB User Interfaces The difficulties encountered by noncomputer science professionals in using database query languages led database researchers to variety of user-friendly user interfaces. The reasons for these difficulties are [61]: 1) the requirement on the part of the user to remember too many details such as the meanings of acronyms used for entity and relation types, and their attributes, 2) inadequate semantics of data models that are usually based on abstract mathematical concepts (e.g., symbolic logic theory or set theory), 3) lack of a facility to formulate queries in a piecemeal fashion (especially important in statistical databases due to the exploratory nature of the data analysis), 4) lack of levels of detail in database schemes, and 5) lack of metadata browsing facility. The proposed or implemented user interfaces by database researchers and practitioners range from powerful graphical query languages with a well-understood expressive power (such as QBE) to natural language-based query languages (such as [20]) and menu-driven, browing-based systems (such as E-R Interface [5]). For SDB user interfaces, we are not aware of any natural language-based query languages. However, there are graphical SDB query languages with a formal expressive power such as STBE and ABE, and menudriven, browsing-based user interfaces with extensive facilities, such as GUIDE, SUBJECT, SEEDIS On-line Codebook, and ALDS Data Editor. Since STBE and ABE are discussed before and in Section VI, below we summarize the menu-driven SDB user interfaces. GUIDE [61] uses the E-R Model to represent entities and relationships explicitly on the screen as a network of objects. Parts of the schema can be removed from the screen by the system automatically (i.e., multiple levels of details) or by the user. To aid user in exploring the metadata, there are two kinds of directories. The hierarchical subject directory organizes the entity types in the database into logical groups hierarchically. The user is guided by the system through this directory to locate the part of the schema he wishes to see on the screen. The hierarchical attribute directory organizes attribute types into hierarchical groups. Both directories are implemented as menus-. There is a facility to order and classify entity and relationship types into groups according to their relevancy to a particular group of users. There are also commands to move the displayed schema around the screen, to zoomin and zoom-out on selected parts, etc.. GUIDE queries are expressed as a traversal along the OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES 1077 in the database. Tabular representations of summary data (summary tables) are widely used in various SDB application areas. Summary tables are not used only for output formatting; they are maintained (mostly manually at present) for bookkeeping, compared, and evaluated, usually over a time span. Therefore, it is proper to consider summary tables as logical SDB data modeling tools, and provide query languages for defining and manipulating summary tables. SDBMS's with varying ranges of summary table creation and/or manipulation capabilities include STRAND, SUBJECT, HSDB, TPL, STL [39], and STBE. Fig. l(a) shows an instance of the summary table 1985DEATH-COUNTS-IN-CUYAHOGA-COUNTY. A summary table consists of a two-dimensional table of summary (cell) attribute values, and category attribute values, in rows and columns of the table, that qualify the cell attribute values. Category attribute values are structured as row and column forests of category attribute trees whose nodes are attribute values. Cell attributes are always simple-(elementary-) valued; category attributes may be simple-valued or set-valued. Fig. 1(b) shows the summary table scheme for the summary table in Fig. 1(a). Attributes COUNT1 and COUNT2 are cell attributes; attribute *DEATH-AGE is a set-valued category attribute (indicated by the prefix '*'); and attributes SEX, DEATHCAUSE and RACE are simple-valued category attributes. A primitive summary table is a summary table with exactly one cell attribute and the associated category attributes. A summary table in general is represented as a set of primitive summary tables. Fig. l(c) contains the primitive summary table instances 1985-DEATHCOUNTS-BY-SEX-DEATHCAUSE and 1985-DEATHCOUNTS-BY-SEX-RACE of the summary table scheme 1985-DEATHCOUNTS-IN-CUYAHOGA-COUNTY in Fig. 1(b). Sato [47] gives the theoretical foundations for derivability of primitive summary tables from other primitive summary tables and/or atomic data. A relation possibly with set-valued attributes can be used to represent a primitive summary table excluding the order and the type (i.e., row or column) of category attributes. Fig. 1(d) contains the relation instances DEATHCOUNTS1 and DEATHCOUNTS2 that represent primitive summary tables 1985-DEATHCOUNTS-BYSEX-DEATHCAUSE and 1985-DEATHCOUNTS-BYSEX-RACE, respectively, of Fig. 1(c). Notice that both relations are nonfirst normal form relations due to the setvalued attribute *DEATH-AGE. A relation instance that represents a primitive summary table (and that has no null cell attribute values, where null stands for nonexistent) is said to be information equivalent [43] to that primitive summary table. The STRAND query language has the capability to create a relation that represents several primitive summary tables having category attributes at once, using a single STRAND operation, called summarization. However, in VI. SUMMARY TABLE MANIPULATION LANGUAGES providing such a powerful operator, STRAND creates an. One of the basic functions of SDBMS's is to create and inflexibility and a user inconvenience in that the database manipulate summary data from the raw or summary data administrator must a priori define the procedures (in the network of entities on the screen. A GUIDE query is a path selected by the user. Users can then formulate local queries in different colors, see their results, and then link those local queries into more complex local queries. SUBJECT system [7] has two basic types of abstractions to represent the data and metadata of SDB's using a directed acyclic graph (i.e., a hierarchy), called the SUBJECT graph. The cluster abstraction of the SUBJECT graph represents the set membership relationship according to a common property, or the clustering of entities according to a common property. For example, entities "male" and "female" are clustered into a set "sex." Or "white," "black," "hispanic," etc., are clustered into a set "race." The cross product abstraction utilizes category attributes and summary attributes. Entities in statistical databases are commonly partitioned into groups using descriptive (category) attribute values, and a quantitative (summary) attribute is aggregated to obtain a single summary value (for a new summary attribute). The cross product abstraction represents the cross product of an n-dimensional space where each dimension corresponds to a category attribute, and each combination of category attribute values corresponds to a single summary value. SUBJECT system provides an interactive facility for specifying the SUBJECT graph, a browsing facility, and a document command to attach textual information to each node in the graph. SUBJECT queries are specified during browsing using menu techniques and a small set of commands. The user moves around (browses) the SUBJECT graph, and includes nodes into the set of query conditions. The conditions are then anded (i.e., only conjunctive queries are allowed) and the output is displayed. Different semantics are attached to cross product and cluster nodes so that, if they are selected for the query, automatic aggregation consistent with natural language expressions involving summary data is performed by the SUBJECT system. SEEDIS [36] is a distributed system for the retrieval, analysis and display of geographically linked data. For identifying the data to be retrieved (i.e., data selection), a SEEDIS user defines a geographic scope and level, and selects the desired data items from an on-line data dictionary using a browsing facility. Then the extract command retrieves the selected data items. ALDS data editor [54] of the ALDS system is an experimental data editor and a subset generator. It has a set of commands to specify subsets of data, and uses a graphical representation of data analysis environments (using a directed acyclic graph representation on the screen) with various features such as defining views (called virtual subsets) and attaching conditions or environmental parameters to a data manipulation operation. ALDS has an interface to the statistical package MINITAB (alternative three). IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 10, OCTOBER 1985 1078 1985-DEATH-COUNTS- DEATH-AGE: T IN-CUrYAHOGA-COUNTY ,9g| } (00.-,39} {40. DEATH-CAUSE: Cancer 30 399 RACE: White Black Other 278 301 456 421 310 F SEX: 60 Heart Failure || DET-AUSE: M 101 White Black Other RACE: s 7 611 608 807 503 135 127 (a) 1985-DEATH-COUNTSIN-CUYAHOGA-CO NTY |DEATH-AGE DEATH-CAUSE COUNTI SEXCOUNT2 RACE (b) 1986-DEATH-COUNTS- *DEATH-AGE: BY-SEX-DEATHCAUSE SEX: F DEATH-CAUSE: Cancer M DEATH-CAUSE: Heart Failure 1986-DEATH-COUNTS- F {40'. 99}| 30 399 50 276 i *DEATH-AGE: BY-SEX-RACE ({0,. 39} {40.,99} 278 456 421 310 White Black Other RACE: 10". 39} 301 101 SEX: !I RACE: White 611 Black 608 135 Other F 807 .503 127 (C) SEX DEATH-CAUSE *DEATH-AGE COUNT F F M M Cancer Cancer Heart-Failure Heart-Failure (0..{,39} (40,. ,99} (0..{,39} {40,...,99} 30 300 50 275 SEX RACE *DEATH-AGE COUNT F F F F F F M M M M M M White White Black Black Other Other White White Black Black Other Other ({0,.,39} {40,.,99} 278 456 301 421 101 310 611 807 608 503 {0,.,39} (40,.,99} (0,.*,39} {40,.,99} {(,.,39} {40,...,99} (0,.39} (40,.,99} {0,. ,39} {40,_99} 135 127 (d) Fig. 1. (a) Summary table instance. (b) Summary table scheme. (c) Instances of primitive summary tables. (d) Relations DEATHCOUNTS1 and DEATHCOUNTS2. schema) for obtaining values of each cell attribute in each primitive summary table. STRAND does not have any other explicit summary table operations. A cross product abstraction instance in the SUBJECT system may be viewed as a primitive summary table. The aggregation command of the SUBJECT system allows users to obtain new primitive summary tables from other summary tables (when the aggregation is applied to a cross product node in the SUBJECT graph) and from atomic data (when the aggregation is applied to a cluster node in the SUBJECT graph). SUBJECT does not have any other summary table operations. HSDB system has capabilities to create and manipulate primitive summary tables (called elementary summary tables). It has an operation that creates a primitive summary table from a relation. Primitive summary table operations that operate on a primitive summary table and produce a primitive table are projection and reclassification. Projection eliminates one category attribute from a primitive summary table by proper aggregation of cell attribute values. Reclassification merges values of a category attribute into larger disjoint groups and, for each group, computes a new cell attribute value by the proper aggr-egation operator. Both of these operations utilize "hidden information" (e.g., they do not explicitly specify the aggregation function which is stored in the schema and used to obtain the new cell attribute values) and have restrictions in their usage (e.g., projection does not work if the aggregation function used to obtain the original primitive summary table is MEDIAN). The only other summary table operation in HSDB is the concatenation operator which allows users to concatenate primitive summary tables to obtain nonprimitive summary tables whose category attribute trees are simple chains. Ghosh [15] proposes two languages that manipulate a single primitive summary table (called statistical relational table) in which set-valued category attribute values are identified by a single value (e.g, the *DEATH-AGE value {10, * *, 20} is represented by 15). The first language is a relational algebra-based language with commands project to eliminate rows or columns from a primitive summary table and aggregate to remove rows or columns from a primitive summary table by aggregation (identical to the Attribute-Removal-by-Aggregation operation of STL [39]). There are also data sampling (and statistical analysis) commands that use a given primitive summary table cell instance as a raw data (i.e., microdata) population specification from which sampling is done. The second language, Query by Statistical Relational Table (QBSRT) is a two-dimensional graphical language similar to QBE, and has the same manipulative power with the first algebraic language. However, QBSRT, unlike QBE, produces a primitive summary table instance (rather than a scheme) on the terminal screen to specify a query, and therefore may provide too many details to users (see Section V-D) reducing its user friendliness. TPL [63] system contains nine different commands to produce arbitrarily complex summary tables from treestructured files (see Section V-B). Two of these commands, use and select, are already discussed in Section V-B. The table command specifies the row and column forests of category attribute trees 1) by utilizing the ordering among the category attribute trees of the forest, - OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES and 2) by essentially specifying the preorder enumeration of each category attribute tree (i.e., first the root then the subtrees from left to right) using a rather complex syntax. Users can describe a new category or cell attribute from existing attributes (and their values) using define or compute commands. A new cell attribute and its associated category attributes can be defined from the computed cell attributes using the post-compute command. For time series data, the relative time command relieves the users of the burden of continuously changing the values of the category attribute DATE. Finally, median and quantile commands allow median and quantile (cell attribute) values to be produced. TPL has very powerful summary table creation facilities from data file(s). However, it is executed as a standalone system in batch mode, and lacks commands that operate on previously produced summary tables (i.e., it does not manipulate summary tables). STBE has the power to create and manipulate arbitrarily complex summary tables. For summary table creation in a query, the user specifies the summary table scheme in the output section of the query, using a parenthesized expression. The corresponding summary table skeleton is graphically displayed by STBE. The user then proceeds to specify the example query which may extract information from summary tables and/or relations. Whenever the STBE query has a reference to a primitive summary table of a certain summary table, that primitive summary table is extracted and converted into the corresponding (set-valued) relation. Therefore, after this conversion, as far as the query processing is concerned, STBE query may be regarded as manipulating only relations, and producing a set of relations (if the output is a summary table) or a single relation. This approach of converting the references to a summary table into references to a set of relations leads to an integrated query language with a well-understood expressive power, and an efficient query processing technique [41]. The Summary Table Language (STL) [39] contains a set of summary table manipulation operators, that, together with the relational algebra of set-valued relations (as defined in [431), form an algebraic language for creating and manipulating set-valued relations and arbitrary summary tables. STL, the algebraic counterpart of STBE, has six basic operations. Relation Formation (REL) and Primitive Summary Table Formation (ST) operations provide the conversions of a primitive summary table to/from the corresponding relation. Concatenate (CONC) operation concatenates two summary tables that have the same row or column forests of category attribute trees. Extract (EX) operation, the inverse of concatenate, extracts a summary table whose row and column forests each contains a single category attribute tree that belongs to the original input summary table. Attribute Split (SPLIT) and Attribute 1079 ST Set-Valued Relation RELATIONAL ALGEBRA OPERATORS EX G EX,CONC, Nonprimitive Summary Table SPLIT,MERGE Fig. 2. Summary table language (STL) basic operations. provide relation/primitive summary table transformation capabilities. Therefore a nonprimitive summary table can be transformed into a set of (perhaps set-valued) relations and manipulated using the extended relational algebra operators [43]. Fig. 2 describes the objects of STL and the associated operations. Although the basic operations of STL are powerful to manipulate arbitrary summary tables, expressions for some common summary table manipulation queries become quite long. Therefore STL has additional operations (expressible by basic operations) that simplify common expressions significantly. These operations include Aggregation-over-Table, Attribute-Removal-by-Aggregation, and operations for summary table formation from several primitive summary tables and decomposing a summary table into its primitive summary tables. VII. CONCLUDING REMARKS In this paper we give a taxonomy of the existing and proposed statistical database management systems. We then survey the query languages of these systems. It is clear from this survey that there has been a flurry of research activity in SDB's during recent years. However, the research in SDB data models and query languages are far from over. For example, there are several commonly used SDB objects (such as matrices, time series, and historical data) whose manipulations by the current systems are ad hoc and not well-understood. Implementations and evaluations of some of the proposed systems are not yet done. Presently there are no SDB query languages that provide all the capabilities listed in Section II in an integrated manner. New semantic data models capturing the special utilization characteristics of SDB's and the associated query languages remain to be investigated. ACKNOWLEDGMENT Merge (MERGE) operations provide primitive/nonpriThe authors would like to thank D. Batory, S. Ghosh, mitive summary table transformation capabilities by relocating the rows/columns of a summary table. Similarly, and S. Weiss for their comments on an earlier version of relation formation and primitive summary table formation this paper. 1080 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-I1, NO. 10, OCTOBER 1985 REFERENCES [1] G. A. Anderson, T. Snider, B. Robinson, and J. Toporek, "An integrated research support system for inter-package communication and handling large volume output from statistical database analysis operations," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [2] D. S. Batory, "On searching transposed files," ACM Trans. on Database Syst., vol. 4, no. 1, 1979. [3] W. A. Brown, S. B. Navathe, and S. Y. W. Su, "Complex data types and a data manipulation language for scientific and statistical databases," in Proc. 2nd Int. Workshop Statistical Database Managment, Los Altos, CA, Sept. 1983. [4] R. Buhler, "Data manipulation in P-STAT," in Proc. 1st Int. Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [5] R. G. G. Cattell, "An entity-based database user interface," in Proc. ACM SIGMOD Conf, 1980. [6] Computer Corporation of America, File Manager's Technical Reference Manual, Comput. Corp. Amer., Cambridge, MA, Model 204 Database Management Syst., 1979. [7] P. Chan, and A. Shoshani, "SUBJECT: A directory driven system for organizing and accessing large statistical databases," in Proc. VLDB Conf, 1980. [8] P. P. S. Chen, "The entity relationship model: Toward a unifying view of data," ACM Trans. Database Syst., vol. 1, no. 1, 1976. [9] E. F. Codd, "Relational completeness of database sublanguages," in Database Systems (Courant Computer Science Symposia Series, Vol. 6). Englewood Cliffs, NJ: Prentice-Hall, 1972. [10] C. J. Date, An Introduction to Database Systems, 3rd ed. Reading, MA: Addison-Wesley, 1981. [11] D. E. Denning, Cryptography and Data Security. Reading, MA: Addison-Wesley, 1982. [12] S. M. Dintelman and A. T. Maness, "An implementation of a query language supporting path expressions," in Proc. ACM SIGMOD Conf, 1982. [13] D. E. Denning, W. Nicholson, G. Sande, and A. Shoshani, "Research topics in statistical database management," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [14] J. B. Fry, "Data manipulation in SPSS and SPSS-X," in Proc. 1st LBL Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [15] 5. P. Ghosh, "Statistical relational tables for statistical database management," IBM Res. Lab., San Jose, CA, Tech. Rep. RJ 4394, 1984. [16] -, "An application of statistical databases in manufacturing testing," in Proc. IEEE COMPDEC Conf., 1984. [17] R. Hammond, "Metadata in the RAPID DBMS," in Proc. 1st LBL Workshop Statistical Database Management, Meplo Park, CA, Dec. 1981. [18] S. Heiler and R. F. Bergman, "SIBYL: An economist's workbench," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [19] L. A. Hollabaugh and L. T. Reinwald, "GPI: A statistical package/ database interface," in Proc. Ist LBL Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [20] G. G. Hendrix et al., "Developing a natural language interface to complex data," ACM Trans. Database Syst., vol. 3, no. 2, 1978. [21] H. Ikeda and Y. Kobayashi, "Additional facilities of a conventional DBMS to support interactive statistical analysis," in Proc. 1st LBL Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [22] "SQL/data system: General information," IBM Corp., Rep. GH245012, 1981. [23] G. Jaeschk and H. -J. Schek, "Remaiks on the algebra non first normal form relations," in Proc. Ist ACM SIGACT/SIGMOD PODS Conf, 1982. [24] M. Jarke and J. Koch, "Query optimization in database systems," ACM Comput. Surveys, vol. 16, no. 2, 1984. [25] R. Johnson, "Modelling summary data," in Proc. ACM SIGMOD Conf, 1981. [26] I. Karasolo and P. Svensson, "An overview of CANTOR-A new system for data analysis," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [27] J. C. Klensin, "A statistical database component of a data analysis and modelling system: Lessons from eight years of user experience," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [28] A. Klug, "ABE-A query language for constructing aggregates-by- example," ;in Proc. 1st LBL Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [29] -, "Equivalence of relational algebra and relational calculus query languages having aggregate functions," J. ACM, vol. 29, no. 3, 1982. [30] -, "Access paths in the ABE statistical query facility," in Proc. ACM SIGMOD Conf., 1982. [31] S. Kohji and H. Sato, "Statistical database research project in Japan and the CAS SDB project," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [32] M. Maier and C. Cirilli, "SYSTEM/K: A knowledge base management system," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [331 J. L. McCarthy, "Metadata management for large statistical databases," in Proc. VLDB Conf., 1982. [34] A. T. Maness and S. A. Dintelman, "Design of the genealogical information system," in Proc. Ist Int. Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981. [35] -, "The GENISYS data definition facilities," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [36] D. Merrill, J. McCarthy, F. Gey, and H. Holmes, "Distributed data management in a minicomputer network," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [37] F. Olken, "How baroque should a statistical database management system be?" in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [38] A. Rosenthal and D. Reiner, "Extending the algebraic framework of query processing to handle outerjoins," in Proc. VLDB Conf, 1984. [39] G. Ozsoyoglu, Z. M. Ozsoyoglu, and F. Mata, "A language and a physical organization technique for summary tables," in Proc. ACM SIGMOD Conf., 1985. [40] G. Ozsoyoglu, Z. M. Ozsoyoglu, and V. Matos, "Extending relational algebra and relational calculus with set-valued attributes and aggregate functions," submitted for publication, 1985. [41] G. Ozsoyoglu and V. Matos, "On optimizing summary-table-by-example queries," in Proc. 4th ACM SIGACT/SIGMOD PODS Conf, 1985. [42] G. Ozsoyoglu and Z. M. Ozsoyoglu, "Features of SSDB," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. , "An extension of relational algebra for summary tables," in Proc. [43] 2nd Int. Workshop Statistical Database Management, Los Altos, CA, g Sept. 1983. [44] - "STBE-A database query language for manipulating summary data," in Proc. IEEE COMPDEC Conf, 1984. [45] - , "A query language for statistical databases," in Query Processing in Database Systems, W. Kim, D. Reiner, and D. S. Batory, Eds. New York: Springer-Verlag, 1985. "SSDB-An architqpture for statistical databases," in Proc. 4th [46] IJCIT Conf., 1984. [47] H. Sato, "Handling summary information in a database: derivability," in Proc. ACM SIGMOD Conf., 1981. [48] A. Shoshani, "CABLE: A language based on the E-R model," in Proc. E-R Conf., 1979. , "Statistical databases: Characteristics, problems and some so[49] lutions," in Proc. VLDB Conf, 1982. [50] S. Y. W. Su, S. B. Navathe, and D. S. Batory, "Logical and physical modeling of statistical/scientific databases," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, Ca, Sept. 1983. [51] M. Stonebraker, E. Wong, P. Kreps, and G. Held, "The design and implementation of INGRES," ACM Trans. Database Syst., vol. 1, no. 3, 1976. [52] M. Stonebraker, R. Johnson, and S. Rosenberg, "A rules system for a relational database management system," in Proc. Conf Improving Database Usability and Responsiveness, 1982. [53] S. Y. W. Su, "SAM*: A semantic association model for corporate and scientific-statistical databases," Inform. Sci., vol. 29, 1983. [54] J. J. Thomas and D. L. Hall, "ALDS project: Motivation, statistical database management issues, perspectives, and directions," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983. [55] P. Tsichritzis, "LSL: A link and selector language," in Proc. ACM SIGMOD Conf., 1976. [56] M. Turner, R. Hammond, and P. Cotten, "A DBMS for large statistical databases," in Proc. VLDB Conf., 1979. [57] J. W. Tukey, Exploratory Data Analysis. Reading MA: AddisonWesley, 1977. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-II, NO. 10, OCTOBER 1985 1081 [58] S. E. Weiss, Private Communication. the Department of Computer Information and Science, Cleveland State Uni[59] S. E. Weiss, P. L. Weeks, and N. J. Byrd, "Must we navigate through versity, Cleveland. His research interests include statistical databases, data databases?" in Proc..lst Int; Workshop Statistical Database Manage- models, and expert systems. ment, Menlo Park, CA, Dec. 1981. [60] S. E. Weiss and P. L. Weeks, "PASTE-A tool to put application systems together easily," in Proc. 2nd Int. Workshop Statistical Databse Management, Los Altos, CA, Sept. 1983. [61] H. K. T. Wong and 1. Kuo, "QUIDE: Graphical user interface for database exploration," in Proc. VLDB Conf, 1982. 162] J. D. Ullman, Principles of Database Systems, 2nd ed. Rockville, MD: Computer Science, 1982. [63] Table Producing Language System, version 5, Bureau of Labor Statistics, Washington, DC, July 1980. Zehra Meral Ozsoyoglu received the B.Sc. de[64] M. M. Zloof, "Query-by-example; a database language, " IBMSyst. J., gree in electrical engineering and the M.Sc. de1977. gree in computer science from the Middle East Technical University, Ankara, Turkey, in 1973 and Gultekin Ozsoyoglu (S'79-M'80) received the 1975, respectively, and the Ph.D. degree in comB.Sc. degree in electrical engineering and the puter science from the -University of Alberta, EdM.Sc. degree in computer science from the Midmonton, Alta., Canada, in 1980. dle East Technical University, Ankara, Turkey, in She has been an Assistant Professor of Com1972 and 1974, respectively, and the Ph.D. degree puter Engineering and Science at Case Institute of in computer science from the University of AlTechnology, Case Western Reserve University, Cleveland, OH, since 1980. Her research interests berta, Edmonton, Alta., Canada, in 1980. 2gg W He is presently an Assistant Professor of Com- include query processing in distributed databases, query optimization, dagu @ - E puter Engineering and Science, Case Institute of tabase theory, and statistical databases. Technology, Case Western Reserve University, Dr. Ozsoyoglu was a recipient of an IBM Faculty Development Award, Cleveland, OH. From 1980 to 1983 he was with 1983. Antisampling for Estimation: An Overview NEIL C. ROWE Abstract-We survey a new way to get quick estimates of the values of simple statistks (like count, -mean, standard deviation, maximum, median, and mode frequency) on a large data set. This approach is a comprehensive attempt (apparently the first) to estimate statistics without any sampling. Our "antisampling" techniques have analogies to those of sampling, and exhibit similar estimation accuracy, but can be done much faster than sampling with large computer databases. Antisampling exploits computer science ideas from database theory and expert systems, building an auxiliary structure called a "database abstract." We make detailed comparisons to several different kinds of sampling. Index Terms-Estimation, expert systems, inequalities, parametric optimization, query processing, sampling, statistical computing, statistical databases. I. INTRODUCTION 5,>E R are developing a new approach to estimation of v statistics. This technique, called "antisampling," is fundamentally different from known techniques in that it Antisample Al- A tis-mple Al stnts (e.g. Iowans) J Antisape 2 (e.g. a5ge;O.54) s lection selejtion risample A2 statistics inference Population P- - - (e.g. Iowans ages 30-34) Population P statistics (goal) samilsig infere Ice n Snmple S (e.g. Iowani ages 30-S4 with middle social security digit = 5) ple S statistics Manuscript received February 15, 1985; revised June 1, 1985. This work Fig. 1. General outline of sampling and antisampling. supported in part by the Foundation Research Program of the Naval Postgraduate School with funds provided by the Chief of Naval Research and in part by the Knowledge Base Management Systems Project at Stan- does not involve sampling in any form. Rather, it is a sort ford University under Contract N00039-82-G-0250 from the Defense Ad- of inverse of sampling. vanced Research Projects Agency of the United States Department of DeConsider finite data population P that wish to was some fense. The author is with the Department of Computer Science, Code 52, Naval Postgraduate School, Monterey, CA 93943. we study (see Fig. 1). Suppose that P is large, and it is too much work to calculate many statistics on it, even with a 0098-5589/85/1000-1081$01.00 ( 1985 IEEE