Download Statistical Database Query Languages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.
SE-lI,
NO. 10, OCTOBER 1985
1071
Statistical Database Query Languages
GULTEKIN OZSOYOGLU,
MEMBER, IEEE, AND
Abstract-Databases that are mainly used for statistical analysis are
called statistical databases (SDB). A statistical database management
system (SDBMS) may be defined as a database management system that
provides capabilities 1) to model, store, and manipulate data in a manner suitable for the needs of SDB users, and 2) to apply statistical data
analysis techniques that range from simple summary statistics to advanced procedures. This paper surveys the existing and proposed SDB
data definition and data manipulation (i.e., query) languages.
Index Terms-Database systems, data definition, data manipulation,
query languages, statistical databases.
I. INTRODUCTION
DATABASES that are mainly used for statistical anal-
Jysis are called statistical databases (SDB). Various
SDB application areas include health care, census data
evaluation, economic planning, and management decision
making, among others. A statistical database management
system (SDBMS) may be defined as a database management system that provides capabilities 1) to model, store,
and manipulate data in a manner suitable for the needs of
SDB users, and 2) to apply statistical data analysis techniques that range from simple summary statistics to advanced procedures like discriminant or factor analysis. For
simple summary statistics, the SDBMS is expected to have
powerful, easy-to-use, and efficient data aggregation features. On the other hand, for more advanced statistical
data analysis, the SDBMS provides interface to statistical
analysis procedures, which is either transparent to users
or produces explicit output data ready to be input to statistical analysis procedures.
Most of the database management systems currently
available in the commercial market are designed for business data processing environments. The primary goals of
these so-called "corporate" database management systems (CDBMS) are to improve the productivity of application programmers and to. facilitate easy data access by
naive users [24]. However, CDBMS's are not widely used
in SDB application areas primarily because the conceptual
and internal modeling tools, and query languages that they
provide do not meet SDB users' needs. For example, data
aggregation features of CDBMS's are add-on, ad hoc, and
usually inefficient.
Traditionally the data management needs of SDB users
have been met by restricted data management capabilities
ZEHRA MERAL OZSOYOGLU
of statistical packages and file management systems plus
customized application programs. Another approach'has
been either to extend or'modify the capabilities of an existing CDBMS to accomodate an SDB application or to
build a new SDBMS.
What distinguishes an SDBMS from a statistical package? As more and more data management capabilities are
introduced into statistical packages (such as a B + tree organization in P-STAT [4] or new data manipulation commands of SPSS-X [14], and as the software of statistical
packages become more and more integrated it is important to have some criteria to distinguish SDBMS's from
statistical packages. We think that having the emphasis on
proper data management tools for SDB users such as the
availability of conceptual modeling tools, query languages
and rich internal (physical) modeling constructs, rather
than an emphasis on advanced statistical analysis procedures, is a good criteria to distinguish an SDBMS from a
statistical package software. Using this criteria, in this paper, we distinguish existing and proposed SDBMS's in the
literature, and examine their SDB query (i.e., data definition and data manipulation) languages.
In Section II, we list the criteria used to evaluate SDB.
query languages. Section III gives a taxonomy of proposed and existing SDBMS's, and Sections IV and V examine SDB query languages according to the taxonomy
given in Section III. Section VI discusses languages designed to manipulate summary table, an object commonly
used by SDB users. Section VII contains the concluding
remarks.
II. EVALUATION CRITERIA FOR SDB QUERY LANGUAGES
'Data modeling and manipulation capabilities of
SDBMS's are developed according to the operational use
of data by users. For example, during the exploratory data
analysis phase [57] users deal with representative, interpreted, cleaned or experimental subsets of data. The special utilization cha-racteristics of SDB's necessitate incorporation of extensive metadata capabilities and new objects
such as summary tables, matrices, and scatter diagrams
[33], [37], [42], [46], [49], [50]. Therefore in our survey
of SDB query languages, we evaluate (to the extent possible) specific data and metadata definition capabilities
such
as
Manuscript received February 15, 1985; revised June 5, 1985. This work
*
the
objects definable by the language,,
under
Grant
MCSNational
Science
Foundation
the
was supported by
*
data descriptors (units of measure, scale, missing
8306616. Z. M. Ozsoyoglu was supported by an IBM Faculty Development
Award.
values, data quality information, universe description),
The authors are with the Department of Computer Engineering and Sci* footnotes,
ence, Case Institute of Technology, Case Western Reserve University,
Cleveland, OH 44106.
*
keywords,
0098-5589/85/1000-1071$01.00 © 1985 IEEE
1072
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-li, NO. 10, OCTOBER 1985
* textual description and historical data,
* editing specifications and data structuring capabilities,
and specific data manipulation capabilities such as
* aggregation capabilities,
* subsetting and sampling,
* metadata manipulation,
* handling time explicitly,
* historical data.
In all CDBMS's, aggregate functions such as SUM,
MAX and AVE are incorporated in an ad hoc manner into
the associated query language. The relational algebra and
the relational calculus query languages introduced by
Codd [9] do not formally incorporate aggregate functions.
Since aggregation operations are extremely frequent in statistical databases, query languages of most SDBMS's (such
as STRAND [25], GENISYS [34], SSDB [43] and others) provide powerful and/or user-friendly aggregation capabilities.
Summary tables, tabular representations of summary
(aggregated) data, are so important that almost all statistical packages provide some form of limited summary table output formatting capabilities. There are also some
SDBMS's (such as HSDB [21] and STBE [44]) that provide summary table manipulation languages. Section VI
surveys summary table manipulation languages.
Execution of an advanced statistical analysis procedure
usually requires a long set of parameters to be initialized
-by a set of syntactically complex commands. There are
three ways of interaction between an SDB query language
and a statistical analysis procedure. One approach is to
embed a specific statistical procedures library into an
SDBMS and develop syntactically simpler and easy-to-use
capabilities in the query language to execute statistical
analysis procedures (alternative one). This may be viewed
as a master-slave approach in which the SDBMS dictates
the execution of statistical analysis procedures, and the interface between the SDBMS and the library procedures is
transparent to the user. This approach has been criticized
as being inflexible since it does not permit the users access
different statistical analysis procedures in different packages. In another approach, users specify the execution of
a procedure from a specific package in their query, the
SDBMS prepares the input to that procedure, and the user
later initiates the package procedure execution (alternative two). This entails on the part of SDBMS the capability to produce commands of a specific statistical package.
In the third approach the SDBMS produces a flat table of
data needed for the execution of a statistical package, and
the users are responsible for creating the package commands and for initiating the package execution (alternative three). We will comment on the type of statistical
package interaction in SDB query languages.
It is important to define the expressive (manipulative)
power of a language with respect to an object since such
a definition unambiguously defines what type of manipulations are and are not achievable by the language.
Therefore, whenever possible, we will specify the expressive power of an SDB language. We will also comment on
the ease of use, syntax and functionality of an SDB query
language.
III. THE TAXONOMY OF SDBMS's
In this paper we examine SDB query languages of the
following systems.
1) SDBMS's Built on Top of CDBMS's: Majority of the
CDBMS's in this category are relational systems. Examples are HSDB (on Model 204 [6]) [21], Ghosh's extensions to SQL [15], System/K (on SQL/DS [22]) [32], and
STRAND (on Ingres [51]) [25].
Another approach is to use a Generalized Interface
System that links together available CDBMS's, statistical
packages and graphics software using a single high level
language. Examples are PASTE [60], SIBYL [18], and
GPI [19].
2) Separately Developed SDBMSs': Below we further
categorize these systems by the data model and query language they use.
a) Relational Data Model and Relational Query
Languages: These systems use new internal (file) organization techniques and/or additional conceptual modeling tools and/or well-defined aggregation functions. Examples include RAPID [56] and CAS SDB [31] which use
relational algebra; ABE [28] which uses relational calculus; SIR/SQL [1], GENISYS [34], and CANTOR [26]
which use SQL [10] and JANUS [27] which uses relationlike objects and relational algebra-like operators.
b) Network Data Model: An example is SIR/DBMS
[1].
c) Formally Extended Relational Model and Relational Algebra/Calculus Languages: Examples are SSDL
[3], Klug's work [29], [30], and extensions of Ozsoyoglu
et al. [40], [43].
d) Graphical User Interfaces: Examples are SUBJECT [7], GUIDE [61], ABE [28], STBE [44], SEEDIS
On-line Codebook [36], and ALDS Data Editor [54].
IV. QUERY LANGUAGES OF SDBMS's BUILT ON ToP
OF CDBMS's
The query language STRAND [25] is developed as an
ER-model [8] query language. STRAND expressions are
translated into QUEL statements, the query language of
the relational system INGRES. STRAND is a derivative
of CABLE [48], and it lacks data definition language
(DDL) features. The main advantage of STRAND is to
allow aggregate query formulations in an easy manner
when the query involves a chain of entity sets (i.e., relations) in the ER-model. In such a case it is sufficient to
specify the entity sets by marking the beginning and the
end of a chain of, say, n entity sets. Then the system performs the n-way join using the relationships between the
entity sets in the chain (called the chaining operation).
The only other operations are projection and restriction
on entity sets (which are identical to the projection and
restriction of the relational algebra) and summarization
OZ$OYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES
(aggregation) on iteratively aggregated entity sets (called
summary sets). STRAND is not relationally complete
(i.e., its expressive power of manipulating relations is less
than that of relational algebra as defined by Codd [9] since
there are no set union, set difference and set intersection
operators; and it can only be used with tree-structured
ER-models since existence of more than one path between two entity sets creates query processing ambiguity.
Also there are no time and metadata handling capabilities in STRAND. Because relations are produced by
STRAND, alternative three can be used as interface to
statistical packages.
HSDB [21] is an SDBMS implemented on top of the
relational system Model 204. HSDB has extensive data
descriptors such as the discrimination between discrete/
continuous values (or the original source), missing values,
the unit, the precision, the theoretical distribution, and
summary statistics about the set (or bag) of values in a
column of a relation. In addition, HSDB retains metadata
information about derived data as to when it is derived,
who has derived it and the formula used for derivation.
Also, single-column relations can be created by specifying
the tuple component values in various ways. In addition to
the relations, HSDB maintains summary tables and provides a limited set of summary operations (see Section VI).
For statistical analysis, alternative one is chosen, i.e., the
query language contains a set of statistical analysis procedures as operations on relations or summary tables. For
security, access control commands with limited power are
provided for each relation and summary table.
Ghosh [15] extends Codd's relational model with a new
object, called the statistical relational table (SRT) (which
is identical to a primitive summary table; see Section VI),
and proposes a set of extensions to the SQL language to
create an SRT from relations, to select a smaller SRT
from a given SRT, to further aggregate the information in
an.SRT, to implement statistical sampling techniques such
as stratified sampling or systematic sampling, and to implement statistical data analysis procedures such as time
series analysis or curve fitting (i.e., alternative one to statistical data analysis is proposed).
System/K [34] is an "object-oriented" knowledge base
management system and is built on top of SQL/DS. Although System/K lacks majority of SDBMS characteristics, it has extensive metadata management capabilities
and an interface to user-specific languages (i.e., a User
Specialty Language Interface).
SIBYL [18], PASTE [60], and GPI [19] are examples of
systems that use a CDBMS, statistical packages and
graphics software available off-the-shelf to create a software system for managing data. In addition to the query
language of the underlying CDBMS, these systems provide data restructuring capabilities and commands to
browse, update and extract data. SIBYL is a system that
manages time series data. It uses the relational system
Model 204 as the CDBMS, a database template (for mapping the logical structures of a time series database into
the Model 204 structures), a set of procedures for brows-
1073
ing, updating and extracting data, and statistical packages. GPI has a "customizer" software for tailoring a statistical package to access a specific data/file structure, and
a "dictionary" for describing data/file structures. The
general approach in PASTE is to let the users 1) write
their application programs using the commands of statistical packages and/or the query languages of a CDBMS,
and 2) for each data transformation between different systems (where a system may be a statistical package, a
CDBMS or a graphics package), produce PASTE commands to handle the transformation.
V. QUERY LANGUAGES OF SEPARATELY
DEVELOPED SDBMS's
A. Relational Model-Based SDB Query Languages
Systems in this category use relations as data modeling
tools and an algebra or calculus-based language.
RAPID [17], [56] is a relational system developed by
Statistics Canada and widely used by statistical agencies
in several countries. It uses relational algebra to process
user queries. The main characteristics of RAPID are its
very efficient execution of statistical queries (using transposed files [2]) and efficient storage utilization (using data
compression by encoding). Each RAPID relation is a selfdescribing transposed file. It is self-describing in the sense
that data and metadata about a relation (such as attribute
names, data types, size, domain, last update date, status,
etc.) are stored together in the same file. Additional metadata information is maintained in the RAPID dictionary
(which is a single relation) in terms of entities and items
of entities, and accessed using a special retrieval operator.
An entity may be a relation, a codeset (describes the codes
used for encoding), a value set (a special codeset for qualitative relation columns), and a comment. Items describe
information about entities (e.g., relation items describe
columns, and codeset items define the codes). RAPID is
relationally complete and has an interface (alternative
three) to statistical packages (e.g., SAS, SPSS) and to
summary table producing systems (e.g., TPL).
GENISYS [12], [34], [35] is an SDBMS that provides
a relation-like view of the data in the database where a
relation (an entity) corresponds to a file. Since some relation columns are allowed to contain repeating fields or
ranges of values, relations may be considered to be in nonfirst normal form [62]. The query language GQL of GENISYS is a high-level SQL-like language. For computing
functions, GQL has facilities for users to specify mac^rolike program fragments and to expand the fragments referenced in a query automatically into high-level program
code.
GQL uses predefined (by the DBA) links in a novel way
to specify and execute joins among relations efficiently.
These defined links are similar to those in the Link and
Selector language [55]. The use of links for joins however
means that two relations cannot be joined if there is no
path of links among them. Or if the only path between two
relations involves n relations, it requires an (n - l)-way
join to join the two relations.
1074
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-ll, NO. 10, OCTOBER 1985
GQL allows users to specify aggregation of a population of values by first grouping (not partitionirg) the values using range specifiers, and then applying aggregation
on each group (similar to the aggregation-by-template operator of SSDB). For example, consider the relation
scheme PERSON (ID, SEX, BIRTHYEAR, BIRTHCOUNTY, DEATHYEAR). The GQL query
SELECT AVERAGE (DEATHYEAR-BIRTHYEAR)
BY SEX
BY BIRTHYEAR (1900 ... 1954,5)
BY BIRTHCOUNTY ('UTAH', 'SALT LAKE', '*')
groups persons from the "main logical file" PERSON by
SEX, BIRTHYEAR ranges of size 5 from 1900 to 1954,
and BIRTHCOUNTY values of UTAH, SALT LAKE and
* ("*" denotes "all counties"), and computes the average
age at death for each group. Each BY clause is a range
specifier.
GQL has operations to create synonyms and comments,
to display information on the coding schemes used (GEN-
ISYS uses encoding for data compression) and on the links
in the database. There are operations to append header
information to relations and attach attribute names to columns of output relations. Undefined values are represented as null values. GQL has consistency constraint
checking mechanisms (as a rules system [52]) and abstract data types for attributes which consist of other attributes. There are also facilities for users to browse
through the data dictionary.
ABE [28] is a screen-oriented language similar to
Query-by-Example (QBE) [64]. The main feature of ABE
is to use subqueries with parameters to express aggregations instead of the grouping operations (as in SQL). In
SQL, grouping operations automatically eliminate empty
partitions from the output; therefore, after applying aggregation to the set of partitions, the result does not contain any information about empty partitions (whereas in
ABE empty partitions are retained). This is called the
empty partition problem. Moreover, some nested aggregations expressible by a single ABE query cannot be expressed in SQL [28]. Therefore, in addition to being syntactically simpler and user-friendly, as far as aggregation
queries are concerned, ABE is more powerful than both
SQL and QUEL.
ABE can express conjunctive relational queries (it is related to the relational calculus with aggregate functions
[29]). However, it can not express set union, and therefore
is not relationally complete. Moreover there are some
other simple queries that are not expressible by ABE due
to the limtations of the available query formulation constructs [45].
In ABE, queries that involve all, only, or no qualifiers
(i.e., existential andiuniversal quantifications in predicate
calculus), are handled by set comparison operators (i.e.,
set equality and set containment). This approach is simpler and more user-friendly than the semi-explicit use of
quantifiers in QBE.
The system CANTOR [26], designed for the analysis of
large sets of data, uses an object-based data model and a
data manipulation language, SAL, based on an algebra of
relations. Objects are either elementary objects (e.g., integers, literals, or text) or tuples or set objects. A relation
is a special set object, namely a set of tuples of the same
type. Metadata maintained includes information about
stored data, and consists of three relations. Each object
may have the mode value (i.e., a stored object) or view
(i.e., an expression that, when evaluated, forms a value).
SAL queries are nonrecursive algebraic expressions in
which operators are functions from operand values to result values. If the result value cannot be computed then it
is "undefined" of appropriate type. The unary and binary
operators include arithmetic operators (e.g., +), arithmetic comparison operators (e.g., <), arithmetic functions (e.g., SQR), and logical operators (e.g., AND) that
are valid for the proper object type. Binary set algebra
operators include set equality, set containment, set union,
etc. All basic relational algebra operators such as restriction, generalized projection, selection (with variations like
SELECTMIN, SELECTMAX), and Cartesian product are
available.. A set of aggregation operators such as COMPUTE, SUM and PRODUCT are also provided. There is
a partitioning operator that partitions a relation into a set
of relations and applies aggregation to each relation (similar to SELECT-GROUP BY feature of SQL). Subsystems
for statistical analysis remain to be designed and implemented.
Another experimental system is CAS SDB [31] which
has extensive metadata management facilities, an interface to the statistical package SAS, and uses a subset of
relational algebra as its query language.
JANUS is an SDBMS used within a large-scale data
analysis and modeling system called CONSISTENT [27].
JANUS utilizes relations with set-valued attributes and
null values, and a set of relational algebra-like operators.
It has operations that are approximately equivalent to set
theory operations and directional join (or outer join) [38].
There are also capabilities to attach information to a relation (e.g., the mean or standard deviation of the values
in a column of the relation). JANUS uses the concept of
links (called mappings), similar to the links of GENISYS,
to specify relationships between different relations. It then
uses these links to infer information among relations.
JANUS has an interface to the rest of the CONSISTENT
system for statistical analysis (using alternative two or alternative three).
The SIR/DBMS system [1], currently being developed,
provides a relational view of data on which one can superimpose any hierarchical or network views. The SIR
software provides a relational query system, SIR/SQL +,
which allows the user to deal only with relations. From
examples in [1], SIR/SQL + has the same expressive
power of manipulating relations with SQL. In addition,
SIR/DBMS has facilities for 1) naming, labeling, and documenting the data in the database, 2) data quality control.
(e.g., range and consistency checking, and special handling of missing and undefined data), 3) I/O security con-
OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES
trols, 4) a set of simple statistical procedures that include
frequency distributions and histograms, descriptive statistics, scattergrams, line printer plots and simple linear
regressions, 5) summary data tabulation features (see Section VI), and 6) an interface to statistical packages BMDP,
SPSS and SAS by creating the input data file to these packages (i.e., alternative three).
B. Network or Hierarchical Model-Based SDB
Query Languages
Compared to the relational model, there seem to be
very few SDBMS's that use network or hierarchical data
models. One of these systems proposed is SIR/DBMS
which has a procedural retrieval language that enables the
users to navigate the network (or hierarchical) database.
The details of the navigational query language however
are not clear [1].
The Table Producing Language (TPL) Descriptive
Codebook system (TPLDC) [59] uses a rather unconventional way of utilizing the relationships between entity sets
(where each entity set is a file). First, the database administrator forms a directed graph G of relationships between
entity sets. A one-to-many relationship between entity sets
A and B is represented by a directed edge from A to B. A
one-to-one relationship is also represented by only one directed edge (which is an implementation decision [58]). A
many-to-many relationship between entity sets A and B is
represented by two directed edges, one from A to B, another from B to A. All possible rooted directed trees where
a node does not have more than one "one-to-many" edge
to its children (Rule R) in G are enumerated into a set S,
and all trees which are subtrees of a tree in S are deleted
from S to form the set V. The trees in the set V are specified by association state-ments, and the set V becomes a
permanent feature of the TPLDC system. When a user
wants to produce a summary table (see Section VI) from
the database, he chooses a tree (called view in TPL) from
the set V using the use command; selects a subset of the
entities for summarization in the summary table using logical conditions that involve arithmetic operators, comparison operators, and logical operators; and then specifies
the attributes (called variables) of entities to be extracted
and aggregated for the summary table. The restriction introduced due to the rule R is not needed for unambiguous
query specification; rather, it is an implementation restriction. There are two problems with this approach. First,
the set V may be extremely large. Consider a directed clique G (i.e., a directed graph where for any two nodes A
and B, there is an edge from A to B) with n nodes, in
which each edge represents a "one-to-many" relationship. There are n! number of trees in V. Even when V is
small there may be several trees for users to choose from.
Secondly, if a user's query accesses only a few entity sets,
he still has to deal with possibly large trees. However, most
databases that use the TPLDC system typically have very
small number of entity sets (e.g., n < 5) [58], and TPL
is used in over 200 computer centers around the world
[59].
1075
C. SDB Query Languages that Utilize Formal
Extensions to the Relational Model
Systems in this category include SSDL [3], Klug's work
on relational algebra and calculus, and algebra and calculus extensions of Ozsoyoglu et al. [40], [43].
SSDL is a high-level procedural data manipulation language that manipulates objects of type set, ordered set,
vector, matrix, time, time series, text and G-relation (referred to as complex data types). All of the complex data
types except G-relation are self-explanatory. G-relation
(i.e., generalized relation) is an object type that is used
to represent a data model called the Semantic Association
Data Model (SAM*) [53]. The SAM* models the real
world in terms of a set of interrelated associations: membership, aggregation, generalization, interaction, composition, cross-product, and summarization. These associations are represented by one or more G-relations.
A G-relation is a relation (i.e., a set of tuples) with each
column (attribute) of the relation drawing its values from
a complex domain. A complex domain may be of any
complex data type, including the G-relation itself. Therefore a G-relation is not in first normal form [62] since
tuple components do not always have elementary-valued
(atomic) data types such as integers or reals. G-relation is
also different than the nonfirst normal form relations [23]
(which allow only elementary values and arbitrarily nested
sets of elementary values as tuple components) in the sense
that it allows objects of various data types (e.g., matrix or
ordered set) as tuple components.
Since G-relation is recursively defined, an arbitrary
number of G-relations may be nested inside a single tuple
component of a G-relation. This allows for the construction of arbitrarily complex G-relations. However, internal
(file) organization techniques for G-relations are yet to be
investigated.
Attributes of a G-relation may be distinquished as category (i.e., identifying) or summary attributes. Category
attributes of a tuple qualify the summary attributes of the
tuple which contain the measurements and needed values.
Operators of SSDL include the usual relational algebra
operators for G-relations, set theory operators for sets,
linear algebra operators for vectors and matrices, and string
manipulation operators for text as well as some related set
theory and linear algebra operators for ordered set and
time series. Since SSDL is designed to be highly procedural, it contains explicit constructs to scan tuples of a
relation and to perform manipulations and conditional
evaluations (similar to the for construct of Pascal), the
notion of a currency pointer to retain scanning positions
for nested scans and blocking constructs such as BEGINEND and DO-END. As a result, an SSDL query resembles
a high-level programming language code. This approach
deviates significantly from the notion of providing database users with minimal number of operators for the sake
of simplicity while maintaining the expressive power of the
language at a certain level. There is also some overlap of
capabilities (for first normal form relations) provided by
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 10, OCTOBER 1985
1076
the relational algebra operators, and the scanning and
blocking constructs.
Klug [29] extends relational algebra and relational calculus by incorporating aggregate functions, and shows that
the extended languages have the same expressive power.
Klug's extension of the relational algebra (as defined by
Codd [9]) is a new aggregation operator, called the aggregate formation, that partitions a relation (or an algebra expression evaluating to a relation) on a set of attributes X, applies an aggregate function, say SUM, to each
partition, and outputs the X-value and the associated SUMvalue for each partition. As an example, consider the relation
HOUSES-SOLD (HOUSE#, STATE, COUNTY,
HOUSE-PRICE).
The aggregate formation operation
HOUSES-SOLD <STATE, SUM (HOUSE-PRICE) >
returns a two-column relation with each tuple containing
a state name and the sum of those prices for houses sold
in that state.
Relational calculus originally introduced by Codd uses
an alpha expression that consists of a target list and a
formula. Klug [29] (in order to dynamically define ranges
for variables) extends the relational calculus by replacing
the formula with range formulas and a qualifier, where
a range formula itself is allowed to be a (closed) alpha
expression.
Ozsoyoglu et al. [40], [43] define extended relational
algebra and calculus languages (mainly for manipulating
summary tables-see Section VI) that utilize aggregate
functions and relations with set-valued attributes (common in statistical databases), and show that the algebra
and calculus languages so extended are equivalent in expressive power [40]. The extended algebra uses an aggregation-by-template operator, and pack and unpack
operations for set-valued attributes. The aggregation-bytemplate operator groups tuples of a relation into (not necessarily disjoint) groups using a template relation, applies
an aggregate function to each group, and returns the template value of the group and the associated aggregate
value. The aggregation-by-template is more convenient
than the aggregate formation when there are prespecified
groupings of attributes for aggregation (common in statistical databases). Also the aggregation-by-template is based
on grouping tuples (i.e., a tuple may belong to more than
one group) while the aggregate formation is based on partitioning tuples. However, each aggregation operator is expressible by an algebra expression utilizing the other aggregation operator.
The extended relational calculus of [40] forms the basis
of a user-friendly language, called the Summary-Table-byExample [45], which is the query language of an SDBMS
called SSDB [46]. STBE manipulates summary tables and
relations with set-valued attributes (set-valued relations),
and is similar to QBE and ABE query languages. In STBE,
the user constructs an example query on the screen by fill-
ing in skeletons (i.e., graphical schemes) of relations and
summary tables in hierarchically arranged windows (i.e.,
subqueries). The hierarchical structure of subqueries in
STBE (also in ABE) provides a natural way to specify aggregate functions, and also solves the empty partition
problem. For query processing, STBE queries are converted into an extended relational algebra expression (of
[43]), and transformed into semantically equivalent
expressions which are more efficient to execute by conventional techniques [41].
D. SDB User Interfaces
The difficulties encountered by noncomputer science
professionals in using database query languages led database researchers to variety of user-friendly user interfaces. The reasons for these difficulties are [61]: 1) the
requirement on the part of the user to remember too many
details such as the meanings of acronyms used for entity
and relation types, and their attributes, 2) inadequate semantics of data models that are usually based on abstract
mathematical concepts (e.g., symbolic logic theory or set
theory), 3) lack of a facility to formulate queries in a
piecemeal fashion (especially important in statistical databases due to the exploratory nature of the data analysis),
4) lack of levels of detail in database schemes, and 5) lack
of metadata browsing facility. The proposed or implemented user interfaces by database researchers and practitioners range from powerful graphical query languages
with a well-understood expressive power (such as QBE) to
natural language-based query languages (such as [20]) and
menu-driven, browing-based systems (such as E-R Interface [5]). For SDB user interfaces, we are not aware of
any natural language-based query languages. However,
there are graphical SDB query languages with a formal
expressive power such as STBE and ABE, and menudriven, browsing-based user interfaces with extensive facilities, such as GUIDE, SUBJECT, SEEDIS On-line
Codebook, and ALDS Data Editor. Since STBE and ABE
are discussed before and in Section VI, below we summarize the menu-driven SDB user interfaces.
GUIDE [61] uses the E-R Model to represent entities
and relationships explicitly on the screen as a network of
objects. Parts of the schema can be removed from the
screen by the system automatically (i.e., multiple levels of
details) or by the user. To aid user in exploring the metadata, there are two kinds of directories. The hierarchical
subject directory organizes the entity types in the database into logical groups hierarchically. The user is guided
by the system through this directory to locate the part of
the schema he wishes to see on the screen. The hierarchical attribute directory organizes attribute types into
hierarchical groups. Both directories are implemented as
menus-. There is a facility to order and classify entity and
relationship types into groups according to their relevancy
to a particular group of users. There are also commands
to move the displayed schema around the screen, to zoomin and zoom-out on selected parts, etc..
GUIDE queries are expressed as a traversal along the
OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES
1077
in the database. Tabular representations of summary data
(summary tables) are widely used in various SDB application areas. Summary tables are not used only for output formatting; they are maintained (mostly manually at
present) for bookkeeping, compared, and evaluated, usually over a time span. Therefore, it is proper to consider
summary tables as logical SDB data modeling tools, and
provide query languages for defining and manipulating
summary tables. SDBMS's with varying ranges of summary table creation and/or manipulation capabilities include STRAND, SUBJECT, HSDB, TPL, STL [39], and
STBE.
Fig. l(a) shows an instance of the summary table 1985DEATH-COUNTS-IN-CUYAHOGA-COUNTY. A summary table consists of a two-dimensional table of summary
(cell) attribute values, and category attribute values, in
rows and columns of the table, that qualify the cell attribute values. Category attribute values are structured as row
and column forests of category attribute trees whose
nodes are attribute values. Cell attributes are always simple-(elementary-) valued; category attributes may be simple-valued or set-valued. Fig. 1(b) shows the summary table scheme for the summary table in Fig. 1(a). Attributes
COUNT1 and COUNT2 are cell attributes; attribute
*DEATH-AGE is a set-valued category attribute (indicated by the prefix '*'); and attributes SEX, DEATHCAUSE and RACE are simple-valued category attributes.
A primitive summary table is a summary table with
exactly one cell attribute and the associated category attributes. A summary table in general is represented as a
set of primitive summary tables. Fig. l(c) contains the
primitive summary table instances 1985-DEATHCOUNTS-BY-SEX-DEATHCAUSE and 1985-DEATHCOUNTS-BY-SEX-RACE of the summary table scheme
1985-DEATHCOUNTS-IN-CUYAHOGA-COUNTY in
Fig. 1(b). Sato [47] gives the theoretical foundations for
derivability of primitive summary tables from other primitive summary tables and/or atomic data.
A relation possibly with set-valued attributes can be
used to represent a primitive summary table excluding the
order and the type (i.e., row or column) of category attributes. Fig. 1(d) contains the relation instances
DEATHCOUNTS1 and DEATHCOUNTS2 that represent
primitive summary tables 1985-DEATHCOUNTS-BYSEX-DEATHCAUSE and 1985-DEATHCOUNTS-BYSEX-RACE, respectively, of Fig. 1(c). Notice that both
relations are nonfirst normal form relations due to the setvalued attribute *DEATH-AGE. A relation instance that
represents a primitive summary table (and that has no null
cell attribute values, where null stands for nonexistent) is
said to be information equivalent [43] to that primitive
summary table.
The STRAND query language has the capability to create a relation that represents several primitive summary
tables having category attributes at once, using a single
STRAND operation, called summarization. However, in
VI. SUMMARY TABLE MANIPULATION LANGUAGES
providing such a powerful operator, STRAND creates an.
One of the basic functions of SDBMS's is to create and inflexibility and a user inconvenience in that the database
manipulate summary data from the raw or summary data administrator must a priori define the procedures (in the
network of entities on the screen. A GUIDE query is a
path selected by the user. Users can then formulate local
queries in different colors, see their results, and then link
those local queries into more complex local queries.
SUBJECT system [7] has two basic types of abstractions to represent the data and metadata of SDB's using a
directed acyclic graph (i.e., a hierarchy), called the SUBJECT graph. The cluster abstraction of the SUBJECT
graph represents the set membership relationship according to a common property, or the clustering of entities
according to a common property. For example, entities
"male" and "female" are clustered into a set "sex." Or
"white," "black," "hispanic," etc., are clustered into a
set "race." The cross product abstraction utilizes category attributes and summary attributes. Entities in statistical databases are commonly partitioned into groups
using descriptive (category) attribute values, and a quantitative (summary) attribute is aggregated to obtain a single summary value (for a new summary attribute). The
cross product abstraction represents the cross product of
an n-dimensional space where each dimension corresponds to a category attribute, and each combination of
category attribute values corresponds to a single summary
value.
SUBJECT system provides an interactive facility for
specifying the SUBJECT graph, a browsing facility, and
a document command to attach textual information to
each node in the graph. SUBJECT queries are specified
during browsing using menu techniques and a small set of
commands. The user moves around (browses) the SUBJECT graph, and includes nodes into the set of query conditions. The conditions are then anded (i.e., only conjunctive queries are allowed) and the output is displayed.
Different semantics are attached to cross product and cluster nodes so that, if they are selected for the query, automatic aggregation consistent with natural language
expressions involving summary data is performed by the
SUBJECT system.
SEEDIS [36] is a distributed system for the retrieval,
analysis and display of geographically linked data. For
identifying the data to be retrieved (i.e., data selection),
a SEEDIS user defines a geographic scope and level, and
selects the desired data items from an on-line data dictionary using a browsing facility. Then the extract command
retrieves the selected data items.
ALDS data editor [54] of the ALDS system is an experimental data editor and a subset generator. It has a set
of commands to specify subsets of data, and uses a graphical representation of data analysis environments (using a
directed acyclic graph representation on the screen) with
various features such as defining views (called virtual
subsets) and attaching conditions or environmental parameters to a data manipulation operation. ALDS has an
interface to the statistical package MINITAB (alternative
three).
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 10, OCTOBER 1985
1078
1985-DEATH-COUNTS-
DEATH-AGE:
T
IN-CUrYAHOGA-COUNTY
,9g| }
(00.-,39} {40.
DEATH-CAUSE:
Cancer
30
399
RACE:
White
Black
Other
278
301
456
421
310
F
SEX:
60
Heart Failure ||
DET-AUSE:
M
101
White
Black
Other
RACE:
s
7
611
608
807
503
135
127
(a)
1985-DEATH-COUNTSIN-CUYAHOGA-CO NTY |DEATH-AGE
DEATH-CAUSE
COUNTI
SEXCOUNT2
RACE
(b)
1986-DEATH-COUNTS-
*DEATH-AGE:
BY-SEX-DEATHCAUSE
SEX:
F
DEATH-CAUSE:
Cancer
M
DEATH-CAUSE:
Heart Failure
1986-DEATH-COUNTS-
F
{40'. 99}|
30
399
50
276
i
*DEATH-AGE:
BY-SEX-RACE
({0,. 39}
{40.,99}
278
456
421
310
White
Black
Other
RACE:
10". 39}
301
101
SEX:
!I
RACE:
White
611
Black
608
135
Other F
807
.503
127
(C)
SEX
DEATH-CAUSE
*DEATH-AGE
COUNT
F
F
M
M
Cancer
Cancer
Heart-Failure
Heart-Failure
(0..{,39}
(40,. ,99}
(0..{,39}
{40,...,99}
30
300
50
275
SEX
RACE
*DEATH-AGE
COUNT
F
F
F
F
F
F
M
M
M
M
M
M
White
White
Black
Black
Other
Other
White
White
Black
Black
Other
Other
({0,.,39}
{40,.,99}
278
456
301
421
101
310
611
807
608
503
{0,.,39}
(40,.,99}
(0,.*,39}
{40,.,99}
{(,.,39}
{40,...,99}
(0,.39}
(40,.,99}
{0,. ,39}
{40,_99}
135
127
(d)
Fig. 1. (a) Summary table instance. (b) Summary table scheme. (c) Instances of primitive summary tables. (d) Relations DEATHCOUNTS1
and DEATHCOUNTS2.
schema) for obtaining values of each cell attribute in each
primitive summary table. STRAND does not have any
other explicit summary table operations.
A cross product abstraction instance in the SUBJECT
system may be viewed as a primitive summary table. The
aggregation command of the SUBJECT system allows
users to obtain new primitive summary tables from other
summary tables (when the aggregation is applied to a cross
product node in the SUBJECT graph) and from atomic
data (when the aggregation is applied to a cluster node in
the SUBJECT graph). SUBJECT does not have any other
summary table operations.
HSDB system has capabilities to create and manipulate
primitive summary tables (called elementary summary
tables). It has an operation that creates a primitive summary table from a relation. Primitive summary table operations that operate on a primitive summary table and
produce a primitive table are projection and reclassification. Projection eliminates one category attribute from a
primitive summary table by proper aggregation of cell attribute values. Reclassification merges values of a category
attribute into larger disjoint groups and, for each group,
computes a new cell attribute value by the proper aggr-egation operator. Both of these operations utilize "hidden
information" (e.g., they do not explicitly specify the aggregation function which is stored in the schema and used
to obtain the new cell attribute values) and have restrictions in their usage (e.g., projection does not work if the
aggregation function used to obtain the original primitive
summary table is MEDIAN). The only other summary table operation in HSDB is the concatenation operator
which allows users to concatenate primitive summary tables to obtain nonprimitive summary tables whose category attribute trees are simple chains.
Ghosh [15] proposes two languages that manipulate a
single primitive summary table (called statistical relational table) in which set-valued category attribute values
are identified by a single value (e.g, the *DEATH-AGE
value {10, * *, 20} is represented by 15). The first language is a relational algebra-based language with commands project to eliminate rows or columns from a primitive summary table and aggregate to remove rows or
columns from a primitive summary table by aggregation
(identical to the Attribute-Removal-by-Aggregation operation of STL [39]). There are also data sampling (and statistical analysis) commands that use a given primitive
summary table cell instance as a raw data (i.e., microdata) population specification from which sampling is
done. The second language, Query by Statistical Relational Table (QBSRT) is a two-dimensional graphical language similar to QBE, and has the same manipulative
power with the first algebraic language. However, QBSRT,
unlike QBE, produces a primitive summary table instance
(rather than a scheme) on the terminal screen to specify a
query, and therefore may provide too many details to users
(see Section V-D) reducing its user friendliness.
TPL [63] system contains nine different commands to
produce arbitrarily complex summary tables from treestructured files (see Section V-B). Two of these commands, use and select, are already discussed in Section
V-B. The table command specifies the row and column
forests of category attribute trees 1) by utilizing the ordering among the category attribute trees of the forest,
-
OZSOYOGLU AND OZSOYOGLU: DATABASE QUERY LANGUAGES
and 2) by essentially specifying the preorder enumeration
of each category attribute tree (i.e., first the root then the
subtrees from left to right) using a rather complex syntax.
Users can describe a new category or cell attribute from
existing attributes (and their values) using define or compute commands. A new cell attribute and its associated
category attributes can be defined from the computed cell
attributes using the post-compute command. For time series data, the relative time command relieves the users of
the burden of continuously changing the values of the category attribute DATE. Finally, median and quantile commands allow median and quantile (cell attribute) values to
be produced.
TPL has very powerful summary table creation facilities from data file(s). However, it is executed as a standalone system in batch mode, and lacks commands that operate on previously produced summary tables (i.e., it does
not manipulate summary tables).
STBE has the power to create and manipulate arbitrarily complex summary tables. For summary table creation
in a query, the user specifies the summary table scheme
in the output section of the query, using a parenthesized
expression. The corresponding summary table skeleton is
graphically displayed by STBE. The user then proceeds to
specify the example query which may extract information
from summary tables and/or relations. Whenever the
STBE query has a reference to a primitive summary table
of a certain summary table, that primitive summary table
is extracted and converted into the corresponding (set-valued) relation. Therefore, after this conversion, as far as
the query processing is concerned, STBE query may be
regarded as manipulating only relations, and producing a
set of relations (if the output is a summary table) or a single relation. This approach of converting the references to
a summary table into references to a set of relations leads
to an integrated query language with a well-understood
expressive power, and an efficient query processing technique [41].
The Summary Table Language (STL) [39] contains a set
of summary table manipulation operators, that, together
with the relational algebra of set-valued relations (as defined in [431), form an algebraic language for creating and
manipulating set-valued relations and arbitrary summary
tables. STL, the algebraic counterpart of STBE, has six
basic operations. Relation Formation (REL) and Primitive Summary Table Formation (ST) operations provide
the conversions of a primitive summary table to/from the
corresponding relation. Concatenate (CONC) operation
concatenates two summary tables that have the same row
or column forests of category attribute trees. Extract (EX)
operation, the inverse of concatenate, extracts a summary
table whose row and column forests each contains a single
category attribute tree that belongs to the original input
summary table. Attribute Split (SPLIT) and Attribute
1079
ST
Set-Valued
Relation
RELATIONAL ALGEBRA
OPERATORS
EX
G
EX,CONC,
Nonprimitive
Summary Table
SPLIT,MERGE
Fig. 2. Summary table language (STL) basic operations.
provide relation/primitive summary table transformation
capabilities. Therefore a nonprimitive summary table can
be transformed into a set of (perhaps set-valued) relations
and manipulated using the extended relational algebra operators [43]. Fig. 2 describes the objects of STL and the
associated operations.
Although the basic operations of STL are powerful to
manipulate arbitrary summary tables, expressions for
some common summary table manipulation queries become quite long. Therefore STL has additional operations
(expressible by basic operations) that simplify common
expressions significantly. These operations include Aggregation-over-Table, Attribute-Removal-by-Aggregation, and operations for summary table formation from
several primitive summary tables and decomposing a summary table into its primitive summary tables.
VII. CONCLUDING REMARKS
In this paper we give a taxonomy of the existing and
proposed statistical database management systems. We
then survey the query languages of these systems.
It is clear from this survey that there has been a flurry
of research activity in SDB's during recent years. However, the research in SDB data models and query languages are far from over. For example, there are several
commonly used SDB objects (such as matrices, time series, and historical data) whose manipulations by the current systems are ad hoc and not well-understood. Implementations and evaluations of some of the proposed
systems are not yet done. Presently there are no SDB
query languages that provide all the capabilities listed in
Section II in an integrated manner. New semantic data
models capturing the special utilization characteristics of
SDB's and the associated query languages remain to be
investigated.
ACKNOWLEDGMENT
Merge (MERGE) operations provide primitive/nonpriThe authors would like to thank D. Batory, S. Ghosh,
mitive summary table transformation capabilities by relocating the rows/columns of a summary table. Similarly, and S. Weiss for their comments on an earlier version of
relation formation and primitive summary table formation this paper.
1080
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-I1, NO. 10, OCTOBER 1985
REFERENCES
[1] G. A. Anderson, T. Snider, B. Robinson, and J. Toporek, "An integrated research support system for inter-package communication and
handling large volume output from statistical database analysis operations," in Proc. 2nd Int. Workshop Statistical Database Management,
Los Altos, CA, Sept. 1983.
[2] D. S. Batory, "On searching transposed files," ACM Trans. on Database Syst., vol. 4, no. 1, 1979.
[3] W. A. Brown, S. B. Navathe, and S. Y. W. Su, "Complex data types
and a data manipulation language for scientific and statistical databases," in Proc. 2nd Int. Workshop Statistical Database Managment,
Los Altos, CA, Sept. 1983.
[4] R. Buhler, "Data manipulation in P-STAT," in Proc. 1st Int. Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981.
[5] R. G. G. Cattell, "An entity-based database user interface," in Proc.
ACM SIGMOD Conf, 1980.
[6] Computer Corporation of America, File Manager's Technical Reference Manual, Comput. Corp. Amer., Cambridge, MA, Model 204
Database Management Syst., 1979.
[7] P. Chan, and A. Shoshani, "SUBJECT: A directory driven system for
organizing and accessing large statistical databases," in Proc. VLDB
Conf, 1980.
[8] P. P. S. Chen, "The entity relationship model: Toward a unifying view
of data," ACM Trans. Database Syst., vol. 1, no. 1, 1976.
[9] E. F. Codd, "Relational completeness of database sublanguages," in
Database Systems (Courant Computer Science Symposia Series, Vol.
6). Englewood Cliffs, NJ: Prentice-Hall, 1972.
[10] C. J. Date, An Introduction to Database Systems, 3rd ed. Reading,
MA: Addison-Wesley, 1981.
[11] D. E. Denning, Cryptography and Data Security. Reading, MA: Addison-Wesley, 1982.
[12] S. M. Dintelman and A. T. Maness, "An implementation of a query
language supporting path expressions," in Proc. ACM SIGMOD Conf,
1982.
[13] D. E. Denning, W. Nicholson, G. Sande, and A. Shoshani, "Research
topics in statistical database management," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983.
[14] J. B. Fry, "Data manipulation in SPSS and SPSS-X," in Proc. 1st
LBL Workshop Statistical Database Management, Menlo Park, CA,
Dec. 1981.
[15] 5. P. Ghosh, "Statistical relational tables for statistical database management," IBM Res. Lab., San Jose, CA, Tech. Rep. RJ 4394, 1984.
[16] -, "An application of statistical databases in manufacturing testing," in Proc. IEEE COMPDEC Conf., 1984.
[17] R. Hammond, "Metadata in the RAPID DBMS," in Proc. 1st LBL
Workshop Statistical Database Management, Meplo Park, CA, Dec.
1981.
[18] S. Heiler and R. F. Bergman, "SIBYL: An economist's workbench,"
in Proc. 2nd Int. Workshop Statistical Database Management, Los
Altos, CA, Sept. 1983.
[19] L. A. Hollabaugh and L. T. Reinwald, "GPI: A statistical package/
database interface," in Proc. Ist LBL Workshop Statistical Database
Management, Menlo Park, CA, Dec. 1981.
[20] G. G. Hendrix et al., "Developing a natural language interface to
complex data," ACM Trans. Database Syst., vol. 3, no. 2, 1978.
[21] H. Ikeda and Y. Kobayashi, "Additional facilities of a conventional
DBMS to support interactive statistical analysis," in Proc. 1st LBL
Workshop Statistical Database Management, Menlo Park, CA, Dec.
1981.
[22] "SQL/data system: General information," IBM Corp., Rep. GH245012, 1981.
[23] G. Jaeschk and H. -J. Schek, "Remaiks on the algebra non first normal
form relations," in Proc. Ist ACM SIGACT/SIGMOD PODS Conf,
1982.
[24] M. Jarke and J. Koch, "Query optimization in database systems,"
ACM Comput. Surveys, vol. 16, no. 2, 1984.
[25] R. Johnson, "Modelling summary data," in Proc. ACM SIGMOD
Conf, 1981.
[26] I. Karasolo and P. Svensson, "An overview of CANTOR-A new system for data analysis," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983.
[27] J. C. Klensin, "A statistical database component of a data analysis
and modelling system: Lessons from eight years of user experience,"
in Proc. 2nd Int. Workshop Statistical Database Management, Los
Altos, CA, Sept. 1983.
[28] A. Klug, "ABE-A query language for constructing aggregates-by-
example," ;in Proc. 1st LBL Workshop Statistical Database Management, Menlo Park, CA, Dec. 1981.
[29] -, "Equivalence of relational algebra and relational calculus query
languages having aggregate functions," J. ACM, vol. 29, no. 3, 1982.
[30] -, "Access paths in the ABE statistical query facility," in Proc.
ACM SIGMOD Conf., 1982.
[31] S. Kohji and H. Sato, "Statistical database research project in Japan
and the CAS SDB project," in Proc. 2nd Int. Workshop Statistical
Database Management, Los Altos, CA, Sept. 1983.
[32] M. Maier and C. Cirilli, "SYSTEM/K: A knowledge base management system," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983.
[331 J. L. McCarthy, "Metadata management for large statistical databases," in Proc. VLDB Conf., 1982.
[34] A. T. Maness and S. A. Dintelman, "Design of the genealogical information system," in Proc. Ist Int. Workshop Statistical Database
Management, Menlo Park, CA, Dec. 1981.
[35] -, "The GENISYS data definition facilities," in Proc. 2nd Int.
Workshop Statistical Database Management, Los Altos, CA, Sept.
1983.
[36] D. Merrill, J. McCarthy, F. Gey, and H. Holmes, "Distributed data
management in a minicomputer network," in Proc. 2nd Int. Workshop
Statistical Database Management, Los Altos, CA, Sept. 1983.
[37] F. Olken, "How baroque should a statistical database management
system be?" in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, CA, Sept. 1983.
[38] A. Rosenthal and D. Reiner, "Extending the algebraic framework of
query processing to handle outerjoins," in Proc. VLDB Conf, 1984.
[39] G. Ozsoyoglu, Z. M. Ozsoyoglu, and F. Mata, "A language and a
physical organization technique for summary tables," in Proc. ACM
SIGMOD Conf., 1985.
[40] G. Ozsoyoglu, Z. M. Ozsoyoglu, and V. Matos, "Extending relational
algebra and relational calculus with set-valued attributes and aggregate functions," submitted for publication, 1985.
[41] G. Ozsoyoglu and V. Matos, "On optimizing summary-table-by-example queries," in Proc. 4th ACM SIGACT/SIGMOD PODS Conf,
1985.
[42] G. Ozsoyoglu and Z. M. Ozsoyoglu, "Features of SSDB," in Proc.
2nd Int. Workshop Statistical Database Management, Los Altos, CA,
Sept. 1983.
, "An extension of relational algebra for summary tables," in Proc.
[43]
2nd Int. Workshop Statistical Database Management, Los Altos, CA,
g
Sept. 1983.
[44] - "STBE-A database query language for manipulating summary
data," in Proc. IEEE COMPDEC Conf, 1984.
[45] - , "A query language for statistical databases," in Query Processing in Database Systems, W. Kim, D. Reiner, and D. S. Batory,
Eds. New York: Springer-Verlag, 1985.
"SSDB-An architqpture for statistical databases," in Proc. 4th
[46]
IJCIT Conf., 1984.
[47] H. Sato, "Handling summary information in a database: derivability," in Proc. ACM SIGMOD Conf., 1981.
[48] A. Shoshani, "CABLE: A language based on the E-R model," in
Proc. E-R Conf., 1979.
, "Statistical databases: Characteristics, problems and some so[49]
lutions," in Proc. VLDB Conf, 1982.
[50] S. Y. W. Su, S. B. Navathe, and D. S. Batory, "Logical and physical
modeling of statistical/scientific databases," in Proc. 2nd Int. Workshop Statistical Database Management, Los Altos, Ca, Sept. 1983.
[51] M. Stonebraker, E. Wong, P. Kreps, and G. Held, "The design and
implementation of INGRES," ACM Trans. Database Syst., vol. 1, no.
3, 1976.
[52] M. Stonebraker, R. Johnson, and S. Rosenberg, "A rules system for
a relational database management system," in Proc. Conf Improving
Database Usability and Responsiveness, 1982.
[53] S. Y. W. Su, "SAM*: A semantic association model for corporate and
scientific-statistical databases," Inform. Sci., vol. 29, 1983.
[54] J. J. Thomas and D. L. Hall, "ALDS project: Motivation, statistical
database management issues, perspectives, and directions," in Proc.
2nd Int. Workshop Statistical Database Management, Los Altos, CA,
Sept. 1983.
[55] P. Tsichritzis, "LSL: A link and selector language," in Proc. ACM
SIGMOD Conf., 1976.
[56] M. Turner, R. Hammond, and P. Cotten, "A DBMS for large statistical databases," in Proc. VLDB Conf., 1979.
[57] J. W. Tukey, Exploratory Data Analysis. Reading MA: AddisonWesley, 1977.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-II, NO. 10, OCTOBER 1985
1081
[58] S. E. Weiss, Private Communication.
the Department of Computer Information and Science, Cleveland State Uni[59] S. E. Weiss, P. L. Weeks, and N. J. Byrd, "Must we navigate through versity, Cleveland. His research interests include statistical databases, data
databases?" in Proc..lst Int; Workshop Statistical Database Manage- models, and expert systems.
ment, Menlo Park, CA, Dec. 1981.
[60] S. E. Weiss and P. L. Weeks, "PASTE-A tool to put application
systems together easily," in Proc. 2nd Int. Workshop Statistical Databse Management, Los Altos, CA, Sept. 1983.
[61] H. K. T. Wong and 1. Kuo, "QUIDE: Graphical user interface for
database exploration," in Proc. VLDB Conf, 1982.
162] J. D. Ullman, Principles of Database Systems, 2nd ed. Rockville,
MD: Computer Science, 1982.
[63] Table Producing Language System, version 5, Bureau of Labor Statistics, Washington, DC, July 1980.
Zehra Meral Ozsoyoglu received the B.Sc. de[64] M. M. Zloof, "Query-by-example; a database language, " IBMSyst. J.,
gree in electrical engineering and the M.Sc. de1977.
gree in computer science from the Middle East
Technical University, Ankara, Turkey, in 1973 and
Gultekin Ozsoyoglu (S'79-M'80) received the
1975, respectively, and the Ph.D. degree in comB.Sc. degree in electrical engineering and the
puter science from the -University of Alberta, EdM.Sc. degree in computer science from the Midmonton, Alta., Canada, in 1980.
dle East Technical University, Ankara, Turkey, in
She has been an Assistant Professor of Com1972 and 1974, respectively, and the Ph.D. degree
puter Engineering and Science at Case Institute of
in computer science from the University of AlTechnology, Case Western Reserve University,
Cleveland, OH, since 1980. Her research interests
berta, Edmonton, Alta., Canada, in 1980.
2gg W
He is presently an Assistant Professor of Com- include query processing in distributed databases, query optimization, dagu @
- E
puter Engineering and Science, Case Institute of tabase theory, and statistical databases.
Technology, Case Western Reserve University,
Dr. Ozsoyoglu was a recipient of an IBM Faculty Development Award,
Cleveland, OH. From 1980 to 1983 he was with 1983.
Antisampling for Estimation: An Overview
NEIL C. ROWE
Abstract-We survey a new way to get quick estimates of the values
of simple statistks (like count, -mean, standard deviation, maximum,
median, and mode frequency) on a large data set. This approach is a
comprehensive attempt (apparently the first) to estimate statistics without any sampling. Our "antisampling" techniques have analogies to
those of sampling, and exhibit similar estimation accuracy, but can be
done much faster than sampling with large computer databases. Antisampling exploits computer science ideas from database theory and expert systems, building an auxiliary structure called a "database abstract." We make detailed comparisons to several different kinds of
sampling.
Index Terms-Estimation, expert systems, inequalities, parametric
optimization, query processing, sampling, statistical computing, statistical databases.
I. INTRODUCTION
5,>E
R are developing a new approach to estimation of
v statistics. This technique, called "antisampling," is
fundamentally different from known techniques in that it
Antisample Al- A tis-mple Al stnts
(e.g. Iowans)
J
Antisape
2
(e.g. a5ge;O.54)
s lection
selejtion
risample A2 statistics
inference
Population P- - - (e.g. Iowans ages 30-34)
Population P statistics
(goal)
samilsig
infere Ice
n
Snmple S
(e.g. Iowani ages 30-S4
with middle social
security digit = 5)
ple S
statistics
Manuscript received February 15, 1985; revised June 1, 1985. This work
Fig. 1. General outline of sampling and antisampling.
supported in part by the Foundation Research Program of the Naval
Postgraduate School with funds provided by the Chief of Naval Research
and in part by the Knowledge Base Management Systems Project at Stan- does not involve sampling in any form. Rather, it is a sort
ford University under Contract N00039-82-G-0250 from the Defense Ad- of inverse of sampling.
vanced Research Projects Agency of the United States Department of DeConsider
finite data population P that
wish to
was
some
fense.
The author is with the Department of Computer Science, Code 52, Naval
Postgraduate School, Monterey, CA 93943.
we
study (see Fig. 1). Suppose that P is large, and it is too
much work to calculate many statistics on it, even with a
0098-5589/85/1000-1081$01.00 ( 1985 IEEE