Download Set Predicates in SQL: Enabling Set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Ingres (database) wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

SQL wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Join (SQL) wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Set Predicates in SQL: Enabling Set-Level Comparisons for
Dynamically Formed Groups
Abstract
In data warehousing and OLAP applications, scalar-level predicates in SQL become
increasingly inadequate to support a class of operations that require set-level comparison
semantics, i.e., comparing a group of tuples with multiple values. Currently, complex SQL
queries composed by scalar-level operations are often formed to obtain even very simple setlevel semantics. Such queries are not only difficult to write but also challenging for a database
engine to optimize, thus can result in costly evaluation. This paper proposes to augment SQL
with set predicate, to bring out otherwise obscured set-level semantics. W e studied two
approaches to processing set predicates—an aggregate function-based approach and a bitmap
index-based approach. Moreover, it designed a histogram-based probabilistic method of set
predicate selectivity estimation, for optimizing queries with multiple predicates. The experiments
verified its accuracy and effectiveness in optimizing queries.
Introduction
Data warehousing and OLAP applications becoming more sophisticated, there is a high
demand of querying data with the semantics of set-level comparisons. For instance, a company
may search its resume database for job candidates with a set of mandatory skills. Here, the skills
of each candidate, as a set of values, are compared against the mandatory skills. Such sets are
often dynamically formed. For example, suppose a table Resume Skills (id, skill) connects skills
to job candidates. A GROUP BY clause dynamically groups the tuples in it by id, with the values
on attribute skill in each group forming a set. The problem is that the current GROUP BY clause
can only do scalar value comparison by an accompanying HAVING clause. For instance,
aggregate functions SUM/COUNT/AVG/MAX produce a single numeric value, which is
compared to a literal or another single aggregate value. Observing the demand for complex and
dynamic set-level comparisons in databases, it propose a concept of set predicate.
The semantics of set-level comparisons in many cases can be expressed using current
SQL syntax without the proposed extension. Hoitver, resulting queries would be more complex
than necessary. One consequence is that complex queries are difficult for users to formulate.
More importantly, such complex queries are difficult for DBMS to optimize, fileading to
unnecessarily costly evaluation. The resulting query plans could involve multiple sub queries
with grouping and set operations. On the contrary, the proposed concise syntax of set predicates
enables direct expression of set-level comparisons in SQL, which not only makes query
formulation simple but also facilitates efficient support of such queries. It developed two
approaches to process set predicates: Aggregate function-based approach. This approach
processes set predicates in a way similar t o processing conventional aggregate functions. Given
a query with set predicates, instead of decomposing the query into multiple sub queries, this
approach only needs one pass of table scan. Bitmap index-based approach: This approach
processes set predicates by using bitmap indices on individual attributes. It is efficient because it
can focus on only the tuples from those groups that satisfy query conditions and only the bitmaps
for relevant columns. State-of-the-art bitmap compression methods and encoding strategies have
made it affordable to build bitmap index on many attributes. This index structure is also
applicable on many different types of attributes. The bitmap index-based approach processes
general queries (with joins, selections, multi attribute grouping, and multiple set predicates) by
utilizing single-attribute indices. Hence, it does not require pre computed index for join results or
combination of attributes. It further developed an optimization strategy to handle queries with
multiple set predicates connected by logic operations (AND, OR, NOT). A useful optimization
rule is to prune unnecessary set predicates during query evaluation. Given a query with n
conjunctive set predicates, the predicates can be sequentially evaluated. If no group qualifies
after m predicates are processed, it can terminate query evaluation without processing the
remaining predicates. The number of “necessary” predicates before it can stop, m, depends on
the evaluation order of predicates. It designed a method to select a good order of predicate
evaluation, i.e., an order that results in small m, thus cheap evaluation cost. Our idea is to
evaluate conjunctive (disjunctive) predicates in the ascending (descending) order of their
selectivity, where the selectivity of a predicate is its number of qualified groups. It designed a
probabilistic approach to estimating set predicate selectivity by database histograms.
It proposed to extend SQL with set predicates for an important class of analytical queries,
which otherwise would be difficult to write and optimize. It designed two query evaluation
approaches for set predicates, including an aggregate function-based approach and a bitmap
index-based approach. It developed a histogram-based probabilistic method to estimate the
selectivity of a set predicate, for optimizing queries with multiple predicates. It conducted
extensive experiments to evaluate proposed approaches over both real and synthetic data.
Literature Review
Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join
Predicates
Current data models like the NF2 model and object-oriented models support set-valued
attributes. Hence, it becomes possible to have join predicates based on set comparison. This
paper introduces and evaluates several main memory algorithms to evaluate efficiently this kind
of join. More specifically, we concentrate on the set equality and the subset predicates.
The invention of relational database systems, tremendous effort has been undertaken in
order to develop efficient join algorithms. Starting from a simple nested-loop join algorithm, the
rest improvement was the introduction of the merge join. Later, the hash join and its
improvements became alternatives to the merge join. (For overviews see and for a comparison
between the sort-merge and hash joins.) A lot of effort has also been spent on parallelizing join
algorithms based on sorting and hashing. Another important research area is the development of
index structures that allow to accelerate the evaluation of joins. All of these algorithms
concentrate on simple join predicates based on the comparison of two atomic values.
Predominant is the work on equijoins, i.e., where the join predicate is based on the equality of
atomic values. Only a few articles deal with special non-equijoins in conjunction with aggregate
functions, and pointer-based joins. An area where more complex join predicates occur is that of
spatial database systems. Here, special algorithms to supp ort spatial joins have been developed.
Despite this large body of work on efficient join processing, the authors are not aware of
anywork describing join algorithms for the efficient computation of the join if the join predicate
is based on set comparisons like set equality (=) or sub set equal (). These joins were irrelevant in
the relational context since attribute values had to be atomic. However, newer data models like
NF2 or object-oriented models like the ODMG-Model support set-valued attributes, and many
interesting queries require a join based on set comparison. Consider for example the query for
faithful couples. There, we join persons with persons in condition of the equality of their children
attributes. Another example query is that of job matching. There, we join job owners with
persons such that the set-valued attribute required-skills is a subset of the persons' set-valued
attribute skills. We could give plenty of more queries involving joins based on set comparisons
but we think these for motivation. The existence of two relations R1 and R2 with set-valued join
attributes a and b. We don't care ab out the exact type of the attributes a and b{that is whether it
is e.g. a relation, a set of strings, or a set of object identifiers. We just assume that they are sets
and that their elements provide an equality predicate. The goal of the paper is to compute
efficiently the join expressions. More specifically, we introduce join algorithms based on sorting
and hashing and compare their performance with a simple nested-loop strategy. In addition to
this, we describe a tree-based join algorithm and evaluate its performance. For convenience, we
assume that there exists a function m which maps each element within the sets of R1: a and R2: b
to the domain of integers. The function m is dependent of the type of the elements of the set
valued attributes. For integers, the function is identity, for strings and other types, techniques like
folding can be used. From now on, we assume without loss of generality that the type of the sets
is integer. If this is not the case, the function m has to be applied before we do anything else with
the set elements.
The costs of comparing two sets by = or (= differ significantly depend on the algorithm
used. Hence, we rise to a factor of two. For small sets, this might be a reasonable strategy. For
large sets, however, the comparison cost with this simple strategy can be significant. Hence, we
consider further alternatives for set comparison. One obvious alternative to evaluate efficiently s
t is to have a search tree or a hash table representation of t. Since we assume that the
representation of set-valued attributes is not of this kind but instead consists in a list or an array
of elements, we abandoned this solution since the memory consumption and cpu time needed in
order to construct this indexed representations are to o expensive in comparison to the methods
that follow. Another alternative to implement set comparison is based on sorting the elements.
Assuming an array representation of the elements of the set, and denoting the I th element of a
sets by s[i], the following algorithm implements set comparison s = t, if the sets are sorted.
Consider the case in which we want to evaluate s t for two sets s and t. We could check whether
each element in s occurs in t. If t is implemented as an array or list, then this algorithm takes O
(|s |*|t|). Set equality can then b e implemented by testing s t and t s.
Efficient Processing of Joins on Set-valued Attribute
Object-oriented and object-relational DBMS support set- valued attributes, which are a
natural and concise way to model complex information. However, there has been limited
research to-date on the evaluation of query operators that apply on sets. In this paper we study
the join of two relations on their set-valued attributes. Various join types are considered, namely
the set containment, set equality, and set overlap joins. We show that the inverted, a powerful
index for selection queries, can also facilitate the efficient evaluation of most join predicates. We
propose join algorithms that utilize inverted filess and compare them with signature-based
methods for several set-comparison predicates. Although sets are ubiquitous in many
applications (document retrieval, semi-structured data management, data mining, etc.), there has
been limited research on the evaluation of database operators that apply on sets. On the other
hand, there has been significant research in the IR community on the management of set-valued
data, triggered by the need for fast content-based retrieval in document databases. Research in
this area has focused on processing of keyword-based selection queries. A range of indexing
methods has been proposed, among which signature-based techniques and inverted filess
dominate. These indexes have been extensively revised and evaluated for various selection
queries on set-valued attributes. An important operator which has received limited attention is
the join between two relations on their set-valued attributes. Formally, given two relations R and
S, with set-valued attributes R:r and S:s, R1 rs S returns the subset of their Cartesian product R
S, in which the set-valued attributes qualify the join predicate. Two join operator hat have been
studied to date are the set containment join, where is, and the all nearest neighbors operator,
which retrieves for each set in R its k nearest neighbors in S based on a similarity function. Other
operators include the set equality join and the set overlap join. As an example of a set
containment join, consider the join of a job-offers relation R with a persons relation S such that
R:required skills S:skills, where R:required skills stores the required skills for each job and
S:skills captures the skills of persons. This query will return qualified person-job pairs. As an
example of a set overlap join, consider two bookstore databases with similar book catalogs,
which store information about their clients. We may want to and pairs of clients which have
bought a large number of common books in order to make recommendations or for classification
purposes. In this paper we study the efficient processing of set join operators. We assume that the
DBMS stores the set-valued attribute in a nested representation (i.e., the set elements are stored
together with the other attributes in a single tuple). This representation, as shown in, facilitates
efficient query evaluation. We propose join algorithms that utilize inverted filess and compare
them with signature-based methods for several set-comparison predicate. We propose a new
algorithm for the set containment join, which is significantly faster than previous approaches.
The algorithm builds an inverted for the container relation S and uses it to process the join. We
demonstrate the efficiency and robustness of this method under various experimental settings.
We discuss and validate the efficient processing of other join predicates, namely set equality and
set overlap joins. For each problem, we consider alternative methods based on both signatures
and inverted files. Experiments show that using an inverted files to perform the join is also the
most appropriate choice for overlap joins, whereas signature-based methods are competitive only
for equality joins.
The inverted file is an alternative index for set-valued attributes. For each set element el
in the domain D, an inverted list is created with the record ids of the sets that contain this
element in sorted order. Then these lists are stored sequentially on disk and a directory is created
on top of the inverted file, pointing for each element to the o set of its inverted list in the file. If
the domain cardinality jD j is large, the directory may not t in memory, so its entries h el ; offset
I are organized in a B + {tree having el as key. Given a query set q, assume that we want to and
all sets that contain it. Searching on the inverted le is performed by fetching the inverted lists for
all elements in q and merge-joining them. The final list contains the record ids of the results. For
example, if the set containment query q = fel2; el3 g is applied on the inverted file, only tuple
with rid=132 qualifies. Since the record-ids are sorted in the lists, the join is performed fast. If
the search predicate is `overlap', then the lists are just merged, eliminating duplicates. Notice that
the inverted file can also process -overlap queries efficiently since we just need to count the
number of occurrences of each record-id in the merged list Fetching the inverted lists can incur
many random accesses. In order to reduce the access cost, the elements of q are first sorted and
the lists are retrieved in this order. Pre fetching can be used to avoid random accesses if two lists
are
close
to
each
other.
Updates
on
the
inverted
le are expensive. When a new set is inserted, a number of lists equal to its cardinality have to be
updated. An update-friendly version of the index allows some space at the end of each list for
potential insertions and/or distributes the lists to buckets, so that new disk pages can be allocated
at the end of each bucket. The inverted file shares some features with the bit-sliced signature file.
Both structures merge a set of rid-lists (represented in a different way) and are mostly suitable
for static data with batch updates. In recent comparison studies, the inverted file was proven
significantly faster than signature-based indexes, for set containment queries. There are several
reasons for this. First, it is especially suitable for real-life applications where the sets are sparse
and the domain cardinality large. In this case, signature-based methods generate many false
drops. Second, it applies exact indexing, as opposed to signature-based methods, which retrieve a
superset of the qualifying rids and require a refinement step. Finally, as discussed in the next
subsection, it achieves higher compression rates, than signature-based approaches and its
performance is files sensitive to the decompression overhead. Both signature-based indexes and
inverted files can use compression to reduce the storage and retrieval costs. Dense signatures are
already compressed representations of sets, but sparse ones could require more storage than the
sets (e.g., the exact signature representations of small sets). Thus, a way to compress the
signatures is to reduce their file length b at the cost of increasing the false drop probability.
Another method is to store the element-ids instead of the signature for small sets or to encode the
gaps between the 1's instead of storing the actual bitmap. However, this technique increases the
cost of query processing; the signature has to be constructed on-the- y if we want to still use fast
bitwise operations. The bit-sliced index can also be compressed by encoding the gaps between
the 1's, but again this comes at the expense of extra query processing cost.
The inverted file for a relation R can be constructed fast using sorting and hashing
techniques. A simple method is to read each tuple t R2R and decompose it to a number of binary
Hel ; t R:rid I tuples, one for each element id el that appears in the set t R:r. Then these tuples are
sorted on el and the index is built. For small relations this is fast enough, however, for larger
relations we have to consider techniques that save I/O and computational cost. A better method is
to partition the binary tuples using a hash function on el and build a part of the index for each
partition individually. Then the indexes are concatenated and the same hash function is used to
define an order for the elements. In order to minimize the size of the partitions, before we write
the partial lists of the partitions to disk, we can sort them on t R:rid and compress them. This
comes at the additional cost of decompressing them and sort the tuples one l to create the
inverted lists for the partition. An alternative method is to sort the partial lists one in order to
save computations when sorting the whole partition in memory. The cost of building the inverted
file comprises the cost of reading the relations, the cost of writing and reading the partitions and
the cost of writing the final inverted file. If compression is used, the over-all cost is not much
larger than the cost of reading R. This is due to the fact that the compressed partial information is
several times smaller than the actual data. A simple indexed nested-loops join method The
simplest and most intuitive method to perform the set containment join using inverted files is to
build an inverted file SIF for S and apply a set containment query for each tuple in R. It is not
hard to see that if SIF its in memory, this method has optimal I/O cost. However, for large
problems SIF may not it in memory and in the worst case it has to be read once for each tuple in
R. Therefore, we should consider alternative techniques that apply in the general case.
Adaptive Algorithms for Set Containment Joins
A set containment join is a join between set-valued attributes of two relations, whose join
condition is specified using the subset (⊆) operator. Set containment joins are deployed in many
database applications, even those that do not support set-valued attributes. In this paper, we
propose two novel partitioning algorithms, called the Adaptive Pick-and-Sweep Join (APSJ) and
the Adaptive Divide-and-Conquer Join (ADCJ), which allow computing set containment joins
efficiently. We show that APSJ outperforms previously suggested algorithms for many data sets,
often by an order of magnitude. We present a detailed analysis of the algorithms and study their
performance on real and synthetic data using an implemented tested. Set containment queries are
utilized in many database applications, especially when the underlying database systems support
set-valued attributes. For example, consider a database application used by a human-resource
broker company to match the skills of job seekers with the skills required by employers. Imagine
that the set of skills needed for filling an open position is stored in the set-valued attribute {req
Skills} of table Jobs(job ID,{req Skills}). Another table, Jobseekers(person ID,{avail Skills}),
keeps a set of skills of potential recruits. Then, a match between the qualifying job seekers and
the jobs can be computed using a set containment query SELECT Jobseekers.personID,
Jobs.jobID WHERE Jobs.{reqSkills}⊆Jobseekers.{availSkills}. In this query, the tables are
joined on their set-valued attributes using the subset operator ⊆ as the join condition. This kind
of join is called set containment join. Set containment joins are used in a variety of other
scenarios. If, for instance, our first relation contained sets of parts used in construction projects,
and the second one contained sets of parts offered by each equipment vendor, we could
determine which construction projects can be supplied by a single vendor using a set
containment join. Or, consider a database application that recommends to students a list of
courses that they are eligible to take. Such recommendation can be computed using a set
containment join on pre requisite courses and courses already taken by students. Notice that
containment queries can be utilized even in database systems that support only atomic attribute
values. Consider a relational database of a stock broker company. Imagine that the investment
portfolios of the customers are kept in a relational table Portfolios (portfolio ID, stock ID, num
Positions), while the information about the composition of the mutual funds is kept in a table
Funds(fund ID, stock ID, percentage). Assume that the stock broker wants to find the portfolios
that mirror mutual funds, i.e. those that contain just a portion of the stocks from a certain mutual
fund, and no other stocks. The query SELECT P.portfolioID, F.fundID FROM Portfolios P,
Funds F WHERE P.stockID = F.stockID GROUP BY P.portfolioID, F.fundID HAVING
COUNT(*) = (SELECT COUNT(*) FROM Portfolios P2 WHERE P2.portfolioID =
P.portfolioID) joins Portfolios and Funds on the stockID attribute and returns those portfolios
whose stock positions are entirely contained in a fund, together with the fundID. To see how this
query works notice that the number of the joining tuples for a given P.portfolioID and F.fundID
must be equal to the total number of stocks in portfolio P.portfolioID. The above query
corresponds to a set containment query, although expressed in a less obvious way. Additional
types of applications for containment joins arise when text or XML documents are viewed as sets
of words or XML elements, or when flat relations are folded into a nested representation. The
two best known algorithms for computing set containment joins efficiently are the Partitioning
Set Join (PSJ) proposed and the Divide-and-Conquer Join (DCJ). PSJ and DCJ introduce crucial
performance gains compared with straightforward approaches. A major limitation of PSJ is that
it quickly becomes ineffective as set cardinalities grow. In contrast, DCJ depends only on the
ratio of set cardinalities in both relations, and, therefore, wins over PSJ when the sets are large.
Often, the sets involved in the join computation are indeed quite large. For instance, biochemical
databases contain sets with many thousands elements each. In fact, the fruit fly (drosophila) has
around 14000 genes, 70-80% of which are active at any time. A snapshot of active genes can
thus be represented as a set of around 10000 elements. PSJ is ineffective for such data sets. A set
containment join is a join between set-valued attributes of two relations, whose join condition is
specified using the subset (⊆) operator. Consider two sample relations R and S shown in Table I.
Each of the relations contains one column with sets of integers, three sets in R and four in S. For
easy reference, the sets of R and S are labeled using letters a, b, c and A, B, C, D, respectively.
Computing the containment join R⊆S amounts to finding all tuples (r, s) ∈ R×S such that r ⊆ s.
In our example, R⊆S={(a, A),(b, B),(c, C)}. Obviously, we can always compute R⊆S in a
straightforward way by testing each tuple in the cross-product R×S for the subset condition. In
our example, such approach would require |R|·|S|=3·4 = 12 set comparisons. For large relations
R and S, doing |R|·|S| comparisons becomes very time consuming. The set comparisons are
expensive, since each one requires traversing and comparing a substantial portion of the elements
of both sets. Moreover, when the relations do not fit into memory, enumerating the cross-product
incurs a substantial I/O cost.
Bit-Sliced Index Arithmetic
The bit-sliced index (BSI) introduces the concept of BSI arithmetic. For any two BSI’s X
and Y on a table T, we show how to efficiently generate new BSI’s Z, V, and W, such that Z = X
+ Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in
BSI X and a value y in BSI Y, the value for r i n BSI Z will be x + y, the value in V will be x y and the value i n W will be MIN(x, y). Since a bitmap representing a set of rows is the
simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multi sets
of rows (with duplicates) resulting from the SQL clauses UNION ALL(addition), EXCEPT ALL
(subtraction), and INTERSECT ALL(min) for definitions of these clauses). Another contribution
of the current paper is to generalize BSI range restrictions from a new non-Boolean form: to
determine the top k BSI-valued rows, for any meaningful value k between one and the total
number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic
problem of text retrieval: given an object-relational table T of rows representing documents, with
a collection type column K representing keyword terms, we demonstrate an efficient algorithm to
find k documents that share the largest number of terms with some query list Q of terms. A great
deal of published work on such problems exists in the Information Retrieval (IR) field. The
algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach
comparable in performance to the most efficient known IR algorithm, a major improvement on
current DBMS text searching algorithms, with the advantage that it uses only indexing we
propose for native database operations Bitmap Index Definition. To create a bitmap index, all N
rows of the underlying table T must be assigned ordinal numbers: 1, 2, . . . , N, called Ordinal
row-positions, or simply Ordinal positions. Then for any index value Xi of an index X on T, a
list of rows in T that have the value Xi can be represented by an Ordinal-position-list such as: 4,
7, 11, 15,17, . . ., or equivalently by a verbatim bitmap, 00010010001000101 . . .. Note that
sparse verbatim bitmaps (having a small number of 1’s relative to 0’s) will be compressed, to
save disk and memory space. Variations on this bitmap index definition were studied in [CHI98,
CHI99, WUB98, WU99]. Ordinal row-positions 1, . . . ,N can be assigned to table pages in fixed
size blocks of size J,1 through J on the first page, J+1 through 2J on the second page, etc., where
J is the maximum number of rows of T that will fit on a page (i.e., the maximum occurs for the
shortest rows). This makes it possible to determine the zero-based page number pn for a row
with Ordinal position n by the formula p n= (N-1)/J. A known page number can then be accessed
very quickly when long extents of the table are mapped to contiguous physical disk addresses.
Since variable-length rows might lead to fewer rows on a page, some pages might have no rows
for the larger Ordinal numbers assigned; for this reason, an Existence Bitmap (EBM) is
maintained for the table, containing 1 bits in Ordinal positions where rows exist, and 0 bits
otherwise. The EBM can also be useful if rows are deleted, making it possible to defer index
updates. It is a common misunderstanding that every row-list in a bitmap index must be carried
in verbatim bitmap form. In reality, some form of compression is always used for sparse bitmaps
(although verbatim bitmaps are preferred down to a relatively sparse ratio of 1’s to 0’s such as
1/50, because many operations on verbatim bitmaps are more CPU efficient than on compressed
forms). In the architecture implemented for this paper, which we call the BITSLICE
architecture, bitmap compression simply involves converting sparse bitmap pages into ordered
lists of Segment-Relative Ordinal positions called ORDRIDs (defined below). We first describe
Segmentation, which was introduced in [ON87, ONQ97] and is used in the BITSLICE
architecture Operations on Bitmaps. Pseudo-code for logical operations AND, OR, NOT, and
COUNT on bit maps were provided, so we limit ourselves here to short descriptions. Given two
verbatim bitmaps B1 and B2, we can create the bitmap B3= B1 AND B2 by treating memory
resident Segment fragments of these bitmaps as arrays of long ints in C, and looping through the
fragments, setting B3[I] = B1[I] & B2[I]. The logic can stream through successive Segment
fragments from disk (for B1 and B2) and to disk (B3), until the operation is complete. The
bitmap B3 = B1 OR B2 is computed in the same way, and B3 = NOT B1 is computed by setting
B3[I] = ~B1[I] & EBM[I] in the loop. Note that the efficiency of bitmap operations arises from a
type of parallelism in Boolean operations in CPU registers, specifically SIMD (SingleInstruction-Multiple-Data), where many bits (32, or 64 in some machines) are dealt with in a
single AND, OR, or NOT operation occurring in the simplest possible loop. To find the number
of rows represented in a bitmap B1, COUNT(B1), another SIMD trick is used: the bitmap
fragment to be counted is overlaid with a short int array, and then the loop through the fragment
uses the short ints as indexes into another array containing the number of 1 bits in each short int,
aggregating these into a count variable. We perform logical operations AND and OR on two
Segment ORDRID-lists B1 and B2 by looping through the two lists in order to perform a merge
intersect or merge-union into an ORDRID-list B3; in the case of OR, the resulting ORDRID-list
might grow large enough to require conversion to a verbatim bitmap, an easy case to recognize,
and easily done by initializing a zero Bitmap for this Segment and turning on bits found in the
union. The NOT operation on a Segment ORDRID-list B1 is performed by copying the EBM
Segment and turning off bits in the list corresponding to ORDRIDs found in B1. Toper form
AND and OR with a verbatim bitmap B1 in one index Segment and an ORDRID-list B2 in
another, the ORDRID-list is assumed to have fewer elements and efficiently drives the loop to
access individual bits in the bitmap and perform the Boolean test, in the case of AND, filling in a
new ORDRID-list B3, and in the case of OR, initializing the verbatim bitmap B3 to B1 and
turning on bits from the ORDRID-list B2.
Existing System

State-of-the-art bitmap compression methods and encoding strategies have made it
affordable to build bitmap index on many attributes.

This index structure is also applicable on many different types of attributes. The bitmap
index-based approach processes general queries (with joins, selections, multi attribute
grouping, and multiple set predicates) by utilizing single-attribute indices.

Scalar-level predicates in SQL become increasingly inadequate to support a class of
operations that require set-level comparison semantics, i.e., comparing a group of tuples
with multiple values.

Currently, complex SQL queries composed by scalar-level operations are often formed to
obtain even very simple set-level semantics.

Such queries are not only difficult to write but also challenging for a database engine to
optimize, thus can result in costly evaluation.

Predicates are used in the search condition of WHERE clauses and HAVING clauses, the
join conditions of FROM clauses, and other constructs where a Boolean value is required.

One way to create a table, either on disk or in main memory, is to use a hash table. The
goal of a hash table is to get Theta lookup time , that is, constant lookup time.

A mask is data that is used for bitwise operations, particularly in a bit field. Such sets are
often dynamically formed.

Disadvantages: Such complex queries are difficult for users to formulate. A more severe
consequence is that set-level semantics becomes obscure. Hence, a DBMS may choose
unnecessarily costly evaluation plans for such queries.
Proposed System

In this paper proposes to augment SQL with set predicate, to bring out otherwise
obscured set-level semantics.

It define two approaches to processing set predicates.
1) an aggregate function-based approach
2) a bitmap index-based approach

Moreover, it designed a histogram-based probabilistic method of set predicate selectivity
estimation, for optimizing queries with multiple predicates.

The experiments verified its accuracy and effectiveness in optimizing queries. This paper
focuses on relational data model and architecture.

In this paper propose to extend SQL by set predicates to support set-level comparisons.
Such predicates, combined with grouping, allow selection of dynamically formed groups
by comparison between a group and a set of values.

The GROUP BY clause must follow the conditions in the WHERE clause and must
precede the ORDER BY clause if one is used.

The HAVING clause must follow the GROUP BY clause in a query and must also
precede the ORDER BY clause if used.

Predicates are used in the search condition of WHERE clauses and HAVING clauses, the
join conditions of FROM clauses, and other constructs where a Boolean value is required.

Advantages: It extend SQL by set predicates to support set-level comparisons. It
comparing a group of tuples with multiple values.
Hardware Requirements

Processor Type
: Pentium IV

Speed
: 2.4 GHZ

RAM
: 256 MB

Hard disk
: 20 GB

Keyboard
: 101/102 Standard Keys

Mouse
: Scroll Mouse
Software Requirements

Operating System
: Windows 7

Programming Package
: Net Beans IDE 7.3.1
 Coding Language
: JDK 1.7
System Architecture
End User
Admin
Register
Register
Login
Login
Database
Upload
Dataset
Search
Query
Get
Results
Modules

Register & Login Process

Upload Dataset

Search Query with set predicates
Register & Login Process

In this module, allows both admin and user. Admin create the dataset and upload the
database.

Then user has access the database through keyword search technique.
 Before Login, Both Admin and User has access the database, first register itself. This
type of register frame ask the username, age, user id, password and sex and so on and
upload the database.

A login refers to the credentials required to obtain access to a database or other restricted
area.

Logging is the process by which individual user to a computer system is controlled by
identifying and authenticating the user through the credentials presented by the user.

Both admin and user can access the database using his user id and password.
Upload Dataset

In this module, admin can create or choose the dataset and upload to the database.

Because these datasets are used to relational keyword search from user.

First admin create or choose the dataset. Then extract this dataset and upload to the
database.

Then user give his search query and get relevant result.
Search query with set predicate

The end user give a SQL query with set predicate, to bring out otherwise obscured set
filevel semantics.

In this paper use two approaches to processing set predicates. 1) an aggregate function
based approach 2) a bitmap index-based approach

Aggregate function-based approach: This approach processes set predicates in a way
similar to processing conventional aggregate functions. Given a query with set predicates,
instead of decomposing the query into multiple sub queries, this approach only needs one
pass of table scan.

Bitmap index-based approach: This approach processes set predicates by using bitmap
indices on individual attributes.

Bitmap Index: A bitmap index is a special kind of index that stores the bulk of its data as
bit arrays (bitmaps) and answers most queries by performing bitwise logical operations
on these bitmaps.

Hash table: One way to create a table, either on disk or in main memory, is to use a hash
table. The goal of a hash table is to get Theta lookup time, that is, constant lookup time.

Finally, we get the relevant result.
Data Flow Diagram
Upload Dataset
Get Relevant Result
Search Query
Register & Login
Upload dataset
Get Relevant Result
Database
Search Query
Register & Login
User
References

S. Helmer and G. Moerkotte, “Evaluation of Main Memory Join Algorithms for Joins
with Set Comparison Join Predicates,” Proc. Int’l Conf. Very Large Databases (VLDB),
1996.

N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. ACM
SIGMOD Int’l Conf. Management of Data, pp. 157-168, 2003.

S. Melnik and H. Garcia-Molina, “Adaptive Algorithms for Set Containment Joins,”
ACM Trans. Database Systems, vol. 28, no. 1, pp. 56-99, 2003.

D. Rinfret, P. O’Neil, and E. O’Neil, “Bit-Sliced Index Arithmetic,” Proc. ACM
SIGMOD Int’l Conf. Management of Data, pp. 47-57, 2001.