* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Set Predicates in SQL: Enabling Set
Survey
Document related concepts
Concurrency control wikipedia , lookup
Ingres (database) wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Clusterpoint wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Database model wikipedia , lookup
Transcript
Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically Formed Groups Abstract In data warehousing and OLAP applications, scalar-level predicates in SQL become increasingly inadequate to support a class of operations that require set-level comparison semantics, i.e., comparing a group of tuples with multiple values. Currently, complex SQL queries composed by scalar-level operations are often formed to obtain even very simple setlevel semantics. Such queries are not only difficult to write but also challenging for a database engine to optimize, thus can result in costly evaluation. This paper proposes to augment SQL with set predicate, to bring out otherwise obscured set-level semantics. W e studied two approaches to processing set predicates—an aggregate function-based approach and a bitmap index-based approach. Moreover, it designed a histogram-based probabilistic method of set predicate selectivity estimation, for optimizing queries with multiple predicates. The experiments verified its accuracy and effectiveness in optimizing queries. Introduction Data warehousing and OLAP applications becoming more sophisticated, there is a high demand of querying data with the semantics of set-level comparisons. For instance, a company may search its resume database for job candidates with a set of mandatory skills. Here, the skills of each candidate, as a set of values, are compared against the mandatory skills. Such sets are often dynamically formed. For example, suppose a table Resume Skills (id, skill) connects skills to job candidates. A GROUP BY clause dynamically groups the tuples in it by id, with the values on attribute skill in each group forming a set. The problem is that the current GROUP BY clause can only do scalar value comparison by an accompanying HAVING clause. For instance, aggregate functions SUM/COUNT/AVG/MAX produce a single numeric value, which is compared to a literal or another single aggregate value. Observing the demand for complex and dynamic set-level comparisons in databases, it propose a concept of set predicate. The semantics of set-level comparisons in many cases can be expressed using current SQL syntax without the proposed extension. Hoitver, resulting queries would be more complex than necessary. One consequence is that complex queries are difficult for users to formulate. More importantly, such complex queries are difficult for DBMS to optimize, fileading to unnecessarily costly evaluation. The resulting query plans could involve multiple sub queries with grouping and set operations. On the contrary, the proposed concise syntax of set predicates enables direct expression of set-level comparisons in SQL, which not only makes query formulation simple but also facilitates efficient support of such queries. It developed two approaches to process set predicates: Aggregate function-based approach. This approach processes set predicates in a way similar t o processing conventional aggregate functions. Given a query with set predicates, instead of decomposing the query into multiple sub queries, this approach only needs one pass of table scan. Bitmap index-based approach: This approach processes set predicates by using bitmap indices on individual attributes. It is efficient because it can focus on only the tuples from those groups that satisfy query conditions and only the bitmaps for relevant columns. State-of-the-art bitmap compression methods and encoding strategies have made it affordable to build bitmap index on many attributes. This index structure is also applicable on many different types of attributes. The bitmap index-based approach processes general queries (with joins, selections, multi attribute grouping, and multiple set predicates) by utilizing single-attribute indices. Hence, it does not require pre computed index for join results or combination of attributes. It further developed an optimization strategy to handle queries with multiple set predicates connected by logic operations (AND, OR, NOT). A useful optimization rule is to prune unnecessary set predicates during query evaluation. Given a query with n conjunctive set predicates, the predicates can be sequentially evaluated. If no group qualifies after m predicates are processed, it can terminate query evaluation without processing the remaining predicates. The number of “necessary” predicates before it can stop, m, depends on the evaluation order of predicates. It designed a method to select a good order of predicate evaluation, i.e., an order that results in small m, thus cheap evaluation cost. Our idea is to evaluate conjunctive (disjunctive) predicates in the ascending (descending) order of their selectivity, where the selectivity of a predicate is its number of qualified groups. It designed a probabilistic approach to estimating set predicate selectivity by database histograms. It proposed to extend SQL with set predicates for an important class of analytical queries, which otherwise would be difficult to write and optimize. It designed two query evaluation approaches for set predicates, including an aggregate function-based approach and a bitmap index-based approach. It developed a histogram-based probabilistic method to estimate the selectivity of a set predicate, for optimizing queries with multiple predicates. It conducted extensive experiments to evaluate proposed approaches over both real and synthetic data. Literature Review Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates Current data models like the NF2 model and object-oriented models support set-valued attributes. Hence, it becomes possible to have join predicates based on set comparison. This paper introduces and evaluates several main memory algorithms to evaluate efficiently this kind of join. More specifically, we concentrate on the set equality and the subset predicates. The invention of relational database systems, tremendous effort has been undertaken in order to develop efficient join algorithms. Starting from a simple nested-loop join algorithm, the rest improvement was the introduction of the merge join. Later, the hash join and its improvements became alternatives to the merge join. (For overviews see and for a comparison between the sort-merge and hash joins.) A lot of effort has also been spent on parallelizing join algorithms based on sorting and hashing. Another important research area is the development of index structures that allow to accelerate the evaluation of joins. All of these algorithms concentrate on simple join predicates based on the comparison of two atomic values. Predominant is the work on equijoins, i.e., where the join predicate is based on the equality of atomic values. Only a few articles deal with special non-equijoins in conjunction with aggregate functions, and pointer-based joins. An area where more complex join predicates occur is that of spatial database systems. Here, special algorithms to supp ort spatial joins have been developed. Despite this large body of work on efficient join processing, the authors are not aware of anywork describing join algorithms for the efficient computation of the join if the join predicate is based on set comparisons like set equality (=) or sub set equal (). These joins were irrelevant in the relational context since attribute values had to be atomic. However, newer data models like NF2 or object-oriented models like the ODMG-Model support set-valued attributes, and many interesting queries require a join based on set comparison. Consider for example the query for faithful couples. There, we join persons with persons in condition of the equality of their children attributes. Another example query is that of job matching. There, we join job owners with persons such that the set-valued attribute required-skills is a subset of the persons' set-valued attribute skills. We could give plenty of more queries involving joins based on set comparisons but we think these for motivation. The existence of two relations R1 and R2 with set-valued join attributes a and b. We don't care ab out the exact type of the attributes a and b{that is whether it is e.g. a relation, a set of strings, or a set of object identifiers. We just assume that they are sets and that their elements provide an equality predicate. The goal of the paper is to compute efficiently the join expressions. More specifically, we introduce join algorithms based on sorting and hashing and compare their performance with a simple nested-loop strategy. In addition to this, we describe a tree-based join algorithm and evaluate its performance. For convenience, we assume that there exists a function m which maps each element within the sets of R1: a and R2: b to the domain of integers. The function m is dependent of the type of the elements of the set valued attributes. For integers, the function is identity, for strings and other types, techniques like folding can be used. From now on, we assume without loss of generality that the type of the sets is integer. If this is not the case, the function m has to be applied before we do anything else with the set elements. The costs of comparing two sets by = or (= differ significantly depend on the algorithm used. Hence, we rise to a factor of two. For small sets, this might be a reasonable strategy. For large sets, however, the comparison cost with this simple strategy can be significant. Hence, we consider further alternatives for set comparison. One obvious alternative to evaluate efficiently s t is to have a search tree or a hash table representation of t. Since we assume that the representation of set-valued attributes is not of this kind but instead consists in a list or an array of elements, we abandoned this solution since the memory consumption and cpu time needed in order to construct this indexed representations are to o expensive in comparison to the methods that follow. Another alternative to implement set comparison is based on sorting the elements. Assuming an array representation of the elements of the set, and denoting the I th element of a sets by s[i], the following algorithm implements set comparison s = t, if the sets are sorted. Consider the case in which we want to evaluate s t for two sets s and t. We could check whether each element in s occurs in t. If t is implemented as an array or list, then this algorithm takes O (|s |*|t|). Set equality can then b e implemented by testing s t and t s. Efficient Processing of Joins on Set-valued Attribute Object-oriented and object-relational DBMS support set- valued attributes, which are a natural and concise way to model complex information. However, there has been limited research to-date on the evaluation of query operators that apply on sets. In this paper we study the join of two relations on their set-valued attributes. Various join types are considered, namely the set containment, set equality, and set overlap joins. We show that the inverted, a powerful index for selection queries, can also facilitate the efficient evaluation of most join predicates. We propose join algorithms that utilize inverted filess and compare them with signature-based methods for several set-comparison predicates. Although sets are ubiquitous in many applications (document retrieval, semi-structured data management, data mining, etc.), there has been limited research on the evaluation of database operators that apply on sets. On the other hand, there has been significant research in the IR community on the management of set-valued data, triggered by the need for fast content-based retrieval in document databases. Research in this area has focused on processing of keyword-based selection queries. A range of indexing methods has been proposed, among which signature-based techniques and inverted filess dominate. These indexes have been extensively revised and evaluated for various selection queries on set-valued attributes. An important operator which has received limited attention is the join between two relations on their set-valued attributes. Formally, given two relations R and S, with set-valued attributes R:r and S:s, R1 rs S returns the subset of their Cartesian product R S, in which the set-valued attributes qualify the join predicate. Two join operator hat have been studied to date are the set containment join, where is, and the all nearest neighbors operator, which retrieves for each set in R its k nearest neighbors in S based on a similarity function. Other operators include the set equality join and the set overlap join. As an example of a set containment join, consider the join of a job-offers relation R with a persons relation S such that R:required skills S:skills, where R:required skills stores the required skills for each job and S:skills captures the skills of persons. This query will return qualified person-job pairs. As an example of a set overlap join, consider two bookstore databases with similar book catalogs, which store information about their clients. We may want to and pairs of clients which have bought a large number of common books in order to make recommendations or for classification purposes. In this paper we study the efficient processing of set join operators. We assume that the DBMS stores the set-valued attribute in a nested representation (i.e., the set elements are stored together with the other attributes in a single tuple). This representation, as shown in, facilitates efficient query evaluation. We propose join algorithms that utilize inverted filess and compare them with signature-based methods for several set-comparison predicate. We propose a new algorithm for the set containment join, which is significantly faster than previous approaches. The algorithm builds an inverted for the container relation S and uses it to process the join. We demonstrate the efficiency and robustness of this method under various experimental settings. We discuss and validate the efficient processing of other join predicates, namely set equality and set overlap joins. For each problem, we consider alternative methods based on both signatures and inverted files. Experiments show that using an inverted files to perform the join is also the most appropriate choice for overlap joins, whereas signature-based methods are competitive only for equality joins. The inverted file is an alternative index for set-valued attributes. For each set element el in the domain D, an inverted list is created with the record ids of the sets that contain this element in sorted order. Then these lists are stored sequentially on disk and a directory is created on top of the inverted file, pointing for each element to the o set of its inverted list in the file. If the domain cardinality jD j is large, the directory may not t in memory, so its entries h el ; offset I are organized in a B + {tree having el as key. Given a query set q, assume that we want to and all sets that contain it. Searching on the inverted le is performed by fetching the inverted lists for all elements in q and merge-joining them. The final list contains the record ids of the results. For example, if the set containment query q = fel2; el3 g is applied on the inverted file, only tuple with rid=132 qualifies. Since the record-ids are sorted in the lists, the join is performed fast. If the search predicate is `overlap', then the lists are just merged, eliminating duplicates. Notice that the inverted file can also process -overlap queries efficiently since we just need to count the number of occurrences of each record-id in the merged list Fetching the inverted lists can incur many random accesses. In order to reduce the access cost, the elements of q are first sorted and the lists are retrieved in this order. Pre fetching can be used to avoid random accesses if two lists are close to each other. Updates on the inverted le are expensive. When a new set is inserted, a number of lists equal to its cardinality have to be updated. An update-friendly version of the index allows some space at the end of each list for potential insertions and/or distributes the lists to buckets, so that new disk pages can be allocated at the end of each bucket. The inverted file shares some features with the bit-sliced signature file. Both structures merge a set of rid-lists (represented in a different way) and are mostly suitable for static data with batch updates. In recent comparison studies, the inverted file was proven significantly faster than signature-based indexes, for set containment queries. There are several reasons for this. First, it is especially suitable for real-life applications where the sets are sparse and the domain cardinality large. In this case, signature-based methods generate many false drops. Second, it applies exact indexing, as opposed to signature-based methods, which retrieve a superset of the qualifying rids and require a refinement step. Finally, as discussed in the next subsection, it achieves higher compression rates, than signature-based approaches and its performance is files sensitive to the decompression overhead. Both signature-based indexes and inverted files can use compression to reduce the storage and retrieval costs. Dense signatures are already compressed representations of sets, but sparse ones could require more storage than the sets (e.g., the exact signature representations of small sets). Thus, a way to compress the signatures is to reduce their file length b at the cost of increasing the false drop probability. Another method is to store the element-ids instead of the signature for small sets or to encode the gaps between the 1's instead of storing the actual bitmap. However, this technique increases the cost of query processing; the signature has to be constructed on-the- y if we want to still use fast bitwise operations. The bit-sliced index can also be compressed by encoding the gaps between the 1's, but again this comes at the expense of extra query processing cost. The inverted file for a relation R can be constructed fast using sorting and hashing techniques. A simple method is to read each tuple t R2R and decompose it to a number of binary Hel ; t R:rid I tuples, one for each element id el that appears in the set t R:r. Then these tuples are sorted on el and the index is built. For small relations this is fast enough, however, for larger relations we have to consider techniques that save I/O and computational cost. A better method is to partition the binary tuples using a hash function on el and build a part of the index for each partition individually. Then the indexes are concatenated and the same hash function is used to define an order for the elements. In order to minimize the size of the partitions, before we write the partial lists of the partitions to disk, we can sort them on t R:rid and compress them. This comes at the additional cost of decompressing them and sort the tuples one l to create the inverted lists for the partition. An alternative method is to sort the partial lists one in order to save computations when sorting the whole partition in memory. The cost of building the inverted file comprises the cost of reading the relations, the cost of writing and reading the partitions and the cost of writing the final inverted file. If compression is used, the over-all cost is not much larger than the cost of reading R. This is due to the fact that the compressed partial information is several times smaller than the actual data. A simple indexed nested-loops join method The simplest and most intuitive method to perform the set containment join using inverted files is to build an inverted file SIF for S and apply a set containment query for each tuple in R. It is not hard to see that if SIF its in memory, this method has optimal I/O cost. However, for large problems SIF may not it in memory and in the worst case it has to be read once for each tuple in R. Therefore, we should consider alternative techniques that apply in the general case. Adaptive Algorithms for Set Containment Joins A set containment join is a join between set-valued attributes of two relations, whose join condition is specified using the subset (⊆) operator. Set containment joins are deployed in many database applications, even those that do not support set-valued attributes. In this paper, we propose two novel partitioning algorithms, called the Adaptive Pick-and-Sweep Join (APSJ) and the Adaptive Divide-and-Conquer Join (ADCJ), which allow computing set containment joins efficiently. We show that APSJ outperforms previously suggested algorithms for many data sets, often by an order of magnitude. We present a detailed analysis of the algorithms and study their performance on real and synthetic data using an implemented tested. Set containment queries are utilized in many database applications, especially when the underlying database systems support set-valued attributes. For example, consider a database application used by a human-resource broker company to match the skills of job seekers with the skills required by employers. Imagine that the set of skills needed for filling an open position is stored in the set-valued attribute {req Skills} of table Jobs(job ID,{req Skills}). Another table, Jobseekers(person ID,{avail Skills}), keeps a set of skills of potential recruits. Then, a match between the qualifying job seekers and the jobs can be computed using a set containment query SELECT Jobseekers.personID, Jobs.jobID WHERE Jobs.{reqSkills}⊆Jobseekers.{availSkills}. In this query, the tables are joined on their set-valued attributes using the subset operator ⊆ as the join condition. This kind of join is called set containment join. Set containment joins are used in a variety of other scenarios. If, for instance, our first relation contained sets of parts used in construction projects, and the second one contained sets of parts offered by each equipment vendor, we could determine which construction projects can be supplied by a single vendor using a set containment join. Or, consider a database application that recommends to students a list of courses that they are eligible to take. Such recommendation can be computed using a set containment join on pre requisite courses and courses already taken by students. Notice that containment queries can be utilized even in database systems that support only atomic attribute values. Consider a relational database of a stock broker company. Imagine that the investment portfolios of the customers are kept in a relational table Portfolios (portfolio ID, stock ID, num Positions), while the information about the composition of the mutual funds is kept in a table Funds(fund ID, stock ID, percentage). Assume that the stock broker wants to find the portfolios that mirror mutual funds, i.e. those that contain just a portion of the stocks from a certain mutual fund, and no other stocks. The query SELECT P.portfolioID, F.fundID FROM Portfolios P, Funds F WHERE P.stockID = F.stockID GROUP BY P.portfolioID, F.fundID HAVING COUNT(*) = (SELECT COUNT(*) FROM Portfolios P2 WHERE P2.portfolioID = P.portfolioID) joins Portfolios and Funds on the stockID attribute and returns those portfolios whose stock positions are entirely contained in a fund, together with the fundID. To see how this query works notice that the number of the joining tuples for a given P.portfolioID and F.fundID must be equal to the total number of stocks in portfolio P.portfolioID. The above query corresponds to a set containment query, although expressed in a less obvious way. Additional types of applications for containment joins arise when text or XML documents are viewed as sets of words or XML elements, or when flat relations are folded into a nested representation. The two best known algorithms for computing set containment joins efficiently are the Partitioning Set Join (PSJ) proposed and the Divide-and-Conquer Join (DCJ). PSJ and DCJ introduce crucial performance gains compared with straightforward approaches. A major limitation of PSJ is that it quickly becomes ineffective as set cardinalities grow. In contrast, DCJ depends only on the ratio of set cardinalities in both relations, and, therefore, wins over PSJ when the sets are large. Often, the sets involved in the join computation are indeed quite large. For instance, biochemical databases contain sets with many thousands elements each. In fact, the fruit fly (drosophila) has around 14000 genes, 70-80% of which are active at any time. A snapshot of active genes can thus be represented as a set of around 10000 elements. PSJ is ineffective for such data sets. A set containment join is a join between set-valued attributes of two relations, whose join condition is specified using the subset (⊆) operator. Consider two sample relations R and S shown in Table I. Each of the relations contains one column with sets of integers, three sets in R and four in S. For easy reference, the sets of R and S are labeled using letters a, b, c and A, B, C, D, respectively. Computing the containment join R⊆S amounts to finding all tuples (r, s) ∈ R×S such that r ⊆ s. In our example, R⊆S={(a, A),(b, B),(c, C)}. Obviously, we can always compute R⊆S in a straightforward way by testing each tuple in the cross-product R×S for the subset condition. In our example, such approach would require |R|·|S|=3·4 = 12 set comparisons. For large relations R and S, doing |R|·|S| comparisons becomes very time consuming. The set comparisons are expensive, since each one requires traversing and comparing a substantial portion of the elements of both sets. Moreover, when the relations do not fit into memory, enumerating the cross-product incurs a substantial I/O cost. Bit-Sliced Index Arithmetic The bit-sliced index (BSI) introduces the concept of BSI arithmetic. For any two BSI’s X and Y on a table T, we show how to efficiently generate new BSI’s Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r i n BSI Z will be x + y, the value in V will be x y and the value i n W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multi sets of rows (with duplicates) resulting from the SQL clauses UNION ALL(addition), EXCEPT ALL (subtraction), and INTERSECT ALL(min) for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from a new non-Boolean form: to determine the top k BSI-valued rows, for any meaningful value k between one and the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an object-relational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations Bitmap Index Definition. To create a bitmap index, all N rows of the underlying table T must be assigned ordinal numbers: 1, 2, . . . , N, called Ordinal row-positions, or simply Ordinal positions. Then for any index value Xi of an index X on T, a list of rows in T that have the value Xi can be represented by an Ordinal-position-list such as: 4, 7, 11, 15,17, . . ., or equivalently by a verbatim bitmap, 00010010001000101 . . .. Note that sparse verbatim bitmaps (having a small number of 1’s relative to 0’s) will be compressed, to save disk and memory space. Variations on this bitmap index definition were studied in [CHI98, CHI99, WUB98, WU99]. Ordinal row-positions 1, . . . ,N can be assigned to table pages in fixed size blocks of size J,1 through J on the first page, J+1 through 2J on the second page, etc., where J is the maximum number of rows of T that will fit on a page (i.e., the maximum occurs for the shortest rows). This makes it possible to determine the zero-based page number pn for a row with Ordinal position n by the formula p n= (N-1)/J. A known page number can then be accessed very quickly when long extents of the table are mapped to contiguous physical disk addresses. Since variable-length rows might lead to fewer rows on a page, some pages might have no rows for the larger Ordinal numbers assigned; for this reason, an Existence Bitmap (EBM) is maintained for the table, containing 1 bits in Ordinal positions where rows exist, and 0 bits otherwise. The EBM can also be useful if rows are deleted, making it possible to defer index updates. It is a common misunderstanding that every row-list in a bitmap index must be carried in verbatim bitmap form. In reality, some form of compression is always used for sparse bitmaps (although verbatim bitmaps are preferred down to a relatively sparse ratio of 1’s to 0’s such as 1/50, because many operations on verbatim bitmaps are more CPU efficient than on compressed forms). In the architecture implemented for this paper, which we call the BITSLICE architecture, bitmap compression simply involves converting sparse bitmap pages into ordered lists of Segment-Relative Ordinal positions called ORDRIDs (defined below). We first describe Segmentation, which was introduced in [ON87, ONQ97] and is used in the BITSLICE architecture Operations on Bitmaps. Pseudo-code for logical operations AND, OR, NOT, and COUNT on bit maps were provided, so we limit ourselves here to short descriptions. Given two verbatim bitmaps B1 and B2, we can create the bitmap B3= B1 AND B2 by treating memory resident Segment fragments of these bitmaps as arrays of long ints in C, and looping through the fragments, setting B3[I] = B1[I] & B2[I]. The logic can stream through successive Segment fragments from disk (for B1 and B2) and to disk (B3), until the operation is complete. The bitmap B3 = B1 OR B2 is computed in the same way, and B3 = NOT B1 is computed by setting B3[I] = ~B1[I] & EBM[I] in the loop. Note that the efficiency of bitmap operations arises from a type of parallelism in Boolean operations in CPU registers, specifically SIMD (SingleInstruction-Multiple-Data), where many bits (32, or 64 in some machines) are dealt with in a single AND, OR, or NOT operation occurring in the simplest possible loop. To find the number of rows represented in a bitmap B1, COUNT(B1), another SIMD trick is used: the bitmap fragment to be counted is overlaid with a short int array, and then the loop through the fragment uses the short ints as indexes into another array containing the number of 1 bits in each short int, aggregating these into a count variable. We perform logical operations AND and OR on two Segment ORDRID-lists B1 and B2 by looping through the two lists in order to perform a merge intersect or merge-union into an ORDRID-list B3; in the case of OR, the resulting ORDRID-list might grow large enough to require conversion to a verbatim bitmap, an easy case to recognize, and easily done by initializing a zero Bitmap for this Segment and turning on bits found in the union. The NOT operation on a Segment ORDRID-list B1 is performed by copying the EBM Segment and turning off bits in the list corresponding to ORDRIDs found in B1. Toper form AND and OR with a verbatim bitmap B1 in one index Segment and an ORDRID-list B2 in another, the ORDRID-list is assumed to have fewer elements and efficiently drives the loop to access individual bits in the bitmap and perform the Boolean test, in the case of AND, filling in a new ORDRID-list B3, and in the case of OR, initializing the verbatim bitmap B3 to B1 and turning on bits from the ORDRID-list B2. Existing System State-of-the-art bitmap compression methods and encoding strategies have made it affordable to build bitmap index on many attributes. This index structure is also applicable on many different types of attributes. The bitmap index-based approach processes general queries (with joins, selections, multi attribute grouping, and multiple set predicates) by utilizing single-attribute indices. Scalar-level predicates in SQL become increasingly inadequate to support a class of operations that require set-level comparison semantics, i.e., comparing a group of tuples with multiple values. Currently, complex SQL queries composed by scalar-level operations are often formed to obtain even very simple set-level semantics. Such queries are not only difficult to write but also challenging for a database engine to optimize, thus can result in costly evaluation. Predicates are used in the search condition of WHERE clauses and HAVING clauses, the join conditions of FROM clauses, and other constructs where a Boolean value is required. One way to create a table, either on disk or in main memory, is to use a hash table. The goal of a hash table is to get Theta lookup time , that is, constant lookup time. A mask is data that is used for bitwise operations, particularly in a bit field. Such sets are often dynamically formed. Disadvantages: Such complex queries are difficult for users to formulate. A more severe consequence is that set-level semantics becomes obscure. Hence, a DBMS may choose unnecessarily costly evaluation plans for such queries. Proposed System In this paper proposes to augment SQL with set predicate, to bring out otherwise obscured set-level semantics. It define two approaches to processing set predicates. 1) an aggregate function-based approach 2) a bitmap index-based approach Moreover, it designed a histogram-based probabilistic method of set predicate selectivity estimation, for optimizing queries with multiple predicates. The experiments verified its accuracy and effectiveness in optimizing queries. This paper focuses on relational data model and architecture. In this paper propose to extend SQL by set predicates to support set-level comparisons. Such predicates, combined with grouping, allow selection of dynamically formed groups by comparison between a group and a set of values. The GROUP BY clause must follow the conditions in the WHERE clause and must precede the ORDER BY clause if one is used. The HAVING clause must follow the GROUP BY clause in a query and must also precede the ORDER BY clause if used. Predicates are used in the search condition of WHERE clauses and HAVING clauses, the join conditions of FROM clauses, and other constructs where a Boolean value is required. Advantages: It extend SQL by set predicates to support set-level comparisons. It comparing a group of tuples with multiple values. Hardware Requirements Processor Type : Pentium IV Speed : 2.4 GHZ RAM : 256 MB Hard disk : 20 GB Keyboard : 101/102 Standard Keys Mouse : Scroll Mouse Software Requirements Operating System : Windows 7 Programming Package : Net Beans IDE 7.3.1 Coding Language : JDK 1.7 System Architecture End User Admin Register Register Login Login Database Upload Dataset Search Query Get Results Modules Register & Login Process Upload Dataset Search Query with set predicates Register & Login Process In this module, allows both admin and user. Admin create the dataset and upload the database. Then user has access the database through keyword search technique. Before Login, Both Admin and User has access the database, first register itself. This type of register frame ask the username, age, user id, password and sex and so on and upload the database. A login refers to the credentials required to obtain access to a database or other restricted area. Logging is the process by which individual user to a computer system is controlled by identifying and authenticating the user through the credentials presented by the user. Both admin and user can access the database using his user id and password. Upload Dataset In this module, admin can create or choose the dataset and upload to the database. Because these datasets are used to relational keyword search from user. First admin create or choose the dataset. Then extract this dataset and upload to the database. Then user give his search query and get relevant result. Search query with set predicate The end user give a SQL query with set predicate, to bring out otherwise obscured set filevel semantics. In this paper use two approaches to processing set predicates. 1) an aggregate function based approach 2) a bitmap index-based approach Aggregate function-based approach: This approach processes set predicates in a way similar to processing conventional aggregate functions. Given a query with set predicates, instead of decomposing the query into multiple sub queries, this approach only needs one pass of table scan. Bitmap index-based approach: This approach processes set predicates by using bitmap indices on individual attributes. Bitmap Index: A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (bitmaps) and answers most queries by performing bitwise logical operations on these bitmaps. Hash table: One way to create a table, either on disk or in main memory, is to use a hash table. The goal of a hash table is to get Theta lookup time, that is, constant lookup time. Finally, we get the relevant result. Data Flow Diagram Upload Dataset Get Relevant Result Search Query Register & Login Upload dataset Get Relevant Result Database Search Query Register & Login User References S. Helmer and G. Moerkotte, “Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates,” Proc. Int’l Conf. Very Large Databases (VLDB), 1996. N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 157-168, 2003. S. Melnik and H. Garcia-Molina, “Adaptive Algorithms for Set Containment Joins,” ACM Trans. Database Systems, vol. 28, no. 1, pp. 56-99, 2003. D. Rinfret, P. O’Neil, and E. O’Neil, “Bit-Sliced Index Arithmetic,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 47-57, 2001.