Download Evaluation of Alternative

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Relational algebra wikipedia , lookup

SQL wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Performance Evaluation of Relational
Implementations of Inverted Text Index
Qi Su
Stanford University
[email protected]
ABSTRACT
Information retrieval (IR) systems are adept at
processing keyword queries over unstructured text. In
contrast, relational database management systems
(RDBMS) are designed for queries over structured
data. Recent work has demonstrated the benefits of
implementing the traditional IR system of inverted
index in RDBMS, such as portability, parallelism, and
scalability. We perform an in-depth comparison of
alternative relational implementations of inverted text
index versus a traditional IR system.
1. INTRODUCTION
Database and information retrieval (IR) are two rich
fields of research that have produced ubiquitous tools
such as the relational database management system
(RDBMS) and the web search engine. However,
historically, these two fields have largely developed
independently even though they share one overriding
objective, management of data.
We know that traditional IR systems do not take
advantage of structure of data, or metadata, very well.
Conversely, relational database systems tend to have
limited support for handling unstructured text. Major
database vendors do offer sophisticated IR tools that
are closely integrated with their database engines, for
example, Oracle Text, IBM DB2 Text Information
Extender, and Microsoft SQL Server Full-Text
Search. These tools offer a full range of options, from
Boolean, to ranked, to fuzzy search. However, each
text index is defined over a single relational column.
Hence, significant storage overhead is incurred, first
by storing the plain text in a relational column, and
again by the inverted index built by the text search
tool. These tools offer powerful extensions to the
traditional relational database, but do not address the
full range of IR requirements. Their vendor-specific
nature also means they are not portable solutions.
Yu-Shan Fung
Stanford University
[email protected]
There has been research in the past decade
investigating the use of relational databases to build
inverted index-based information retrieval systems.
There are several key advantages to such an approach.
A pure relational implementation using standard SQL
offers portability across multiple hardware platforms,
OS, and database vendors. Such a system does not
require software modification in order to scale on a
parallel machine, as the DBMS takes care of data
partitioning and parallel query processing. Use of a
relational system enables searching over structured
metadata in conjunction with traditional IR queries.
The DBMS also provides features such as
transactions, concurrent queries, and failure recovery.
Most of the previous works have picked one relational
implementation and compared it with a specialpurpose IR system. Some of them have focused on a
particular advantage, such as scalability on a parallel
cluster.
We propose a comprehensive evaluation of the
alternative relational implementations of inverted text
index that have been discussed in literature, with the
special-purpose IR system Lucene being the baseline
for comparison. We will evaluate the systems on
Boolean queries, phrase queries, and relevance ranked
queries, and benchmark their relative performance in
terms of query response times.
In section 2, we discuss the related work in literature
concerning implementing inverted index as relations
and integrating IR and DBMS. In section 3, we
present the baseline IR and alternative relational
implementations of inverted index systems. In section
4, we review evaluation of these systems, our test
dataset, and queries to be executed. Section 5 presents
the relative performance results collected and our
observations. Finally, in section 6, we present
concluding remarks and future work.
2. RELATED WORK
Several works have picked a single relational
implementation and compared its performance with a
baseline special purpose IR system. Kaufmann et al
[KS95] compared an IR system, BASISPlus, an early
version of Oracle’s text search tool, SQL*TR, and a
relational implementation of the inverted list with two
relations, <term, docid> and <term, docfreq>. The
evaluation dataset is a small 850,000 tuples in the
<term, docid> inverted list. The queries are strictly
conjunctive Boolean queries.
More recent works have shown that Boolean,
proximity, and vector space ranked model searching
can be effectively implemented as standard relations
while offering satisfactory performance when
compared to a baseline traditional IR system.
Grossman et al [GFH97] demonstrates that relational
implementations are effective for Boolean, proximity,
and ranked queries. The relational model
implemented using Microsoft SQL Server consists of
doc_term table <docid, term, term freq>,
doc_term_prox table <docid, term, position>, and idf
table <term, idf>. The baseline IR system is Lotus
Notes, which is a heavy weight system that is not built
specifically for IR tasks. The authors also studied
parallel performance on an AT&T 4-processor
database machine.
Some works have focused on a single advantage of
relational implementations over traditional IR
inverted index. Grabs et al [GBS01] evaluated the
performance and scalability of a database IR system
on a parallel cluster. The system is implemented with
BEA middleware over Oracle database, with
significant emphasis on the transaction semantics to
ensure high levels of search and insert parallelism.
The basic data model is <term, docid>. Only Boolean
queries performances are measured.
Brown et al [BCC94, BR95] demonstrated efficient
inverted index implementation and fast incremental
index update using a database system. However, their
implementation used a persistent object store
manager, which is beyond our scope of using the
traditional relational model and off-the-shelf
RDBMS.
A recent issue of IEEE Data Engineering Bulletin
covered the work by major database vendors to
integrate full text search functionality into the
RDBMS. [MS01], [HN01], [DIX01] presented how
IBM DB2, Microsoft SQL Server, and Oracle
introduce text extensions that are tightly coupled with
the database engine. However, as we discussed
earlier, such an approach is limited in that each text
index must be defined over a single column, and
storing both the full text of the document in the
database, as well as storing the inverted index on the
side, incurs significant storage overhead.
3. SYSTEM IMPLEMENTATIONS
We evaluate four systems on information retrieval
tasks. The baseline system is Lucene, a specialpurpose IR search engine. The three relational designs
are implemented using IBM DB2 Universal Database.
The first relational approach uses the DB2 Text
Information Extender to take care of all the indexing
and query processing. The two remaining relational
approaches implement the inverted index as relations,
and transform keyword queries to standard SQL
queries.
3.1. Lucene
Lucene is an open-source text search engine system
under the Apache project. It is written entirely in Java.
We chose this as our baseline system as it offers ease
of deployment and a full feature set representative of
a traditional IR system.
Lucene includes three key APIs, IndexWriter,
IndexReader, and IndexSearcher. IndexWriter enables
the user to construct an inverted text index over a
corpus of documents. Indexing may be customized
with parameters such as case folding, stemming, etc.
The IndexReader allows the user to probe the contents
of the inverted index, for example, enumerating all
tokens in the index. This important aspect will be
discussed in section 4.1. The IndexSearcher provides
a rich set of search options, including Boolean,
phrase, ranked, and fuzzy search queries.
With our corpus, we will pass one document at a time
to our IndexWriter instance to be tokenized and
indexed. The keys associated with document are the
document ID and URL, which is just the document
file name. At execution time, we use the appropriate
method of the IndexSearcher instance to retrieve the
relevant document URLs from the Lucene index.
By default, all Lucene search hits are returned with a
ranked score. For our case, we only care about the
ranking when we measure the ranked query
performance. Lucene keyword queries are structured
as a single search string. And queries prefix each
keyword with a plus sign. Or queries are a space
delimited list of keywords. Phrase queries are a space
delimited list of keywords enclosed in quotes.
3.2. IBM DB2 Text Information Extender
IBM DB2 Text Information Extender (TIE) is a fulltext search engine tightly coupled with the IBM DB2
Universal Database. It supports the creation of fulltext indexes on textual DB2 table columns. TIE uses
the table primary key to relate inverted index token
entries to their original source tuple in the table. TIE
is invoked as special function calls over columns,
much like an user-defined function.
We create a three column relation named fulltext
<docid, url, text>. The text column is of the type
Binary Large Object (BLOB) and contains the full
text of the document. Our simple parser creates a
single large load file. DB2’s load utility batch loads
the data into the relation. Then we invoke TIE to
index the text column.
Boolean and phrase queries use the contains function.
Ranked queries also use the score function.
And query:
SELECT url
FROM full
WHERE contains (text, ‘”keyword1” & “keyword2” &
“keyword3”’)=1
Or query:
In this representation, each tuple corresponds to a
term/document pair. The relations are:
tf <term, docid, termfreq, positionlist>
idf<term, idf>
url<url, docid>
The bulk load files are created by invoking the
Lucene IndexReader to probe the Lucene text index
over our corpus. We will discuss this aspect further in
sec. 4.
The positionlist attribute of the tf relation is an offset
encoded list of all occurrence positions of the given
term in the given document. For example, term
“hello” appearing in document 2 at positions 10, 100,
102 would be encoded as the tuple
<”hello”, 2, “10,90,12”>
This representation is more compact compared to the
Term-Document-Position approach and is well suited
for Boolean and ranked queries. In the case of phrase
or positional queries, we implement application logic
to merge position lists.
There are two alternatives to implement an AND
query in SQL. The natural choice is to translate an Nword query into an N-way equi-join on the document
ID, where each join relation has been filtered to select
documents containing one of the words.
SELECT u.url
FROM url u, tf t1, tf t2
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND
t1.term=’keyword1’ AND t2.term=’keyword2’
SELECT url
FROM full
WHERE contains (text, ‘”keyword1” | “keyword2” |
“keyword3”’)=1
HashJoin
Phrase query:
HashJoin
Indexscan
on term
SELECT url
FROM full
WHERE contains (text, ‘”keyword1 keyword2 keyword3”’)=1
Ranked query:
WITH temptable (url, score) AS
(SELECT url, score(text, ‘“keyword1” & “keyword2” &
“keyword3”’) FROM fulltext)
SELECT url
FROM temptable
WHERE score>0
ORDER BY score DESC
3.3.
Term - Document
Tablescan
Indexscan
TF
on term
URL
TF
Figure 1. Query Plan for equi-join And query
An alternative due to Grossman [GFH97] treats the
query keywords as an artificial relation, and joins it
with the term-document relation. The result is subject
to group by aggregation by document and only
documents having the correct number of keyword
matches are preserved.
WITH query(term)
AS (values ('keyword1'), ('keyword2'))
SELECT u.url FROM url u, tf d, query q
WHERE u.docid=d.docid AND d.term=q.term
GROUP BY u.url
HAVING count(d.term)=2
However, upon further evaluation (results not shown),
we discovered that the second implementation never
outperforms the first, and in some cases performs over
an order of magnitude worse. Hence, from now on, all
reference to And type queries on both the Term-Doc
and Term-Doc-Position approaches refer to the first
implementation.
Sort, Tablescan,
Groupby, Filter
Hashjoin
URL
on term
Tablescan
Indexscan
on term
Query
TF
Figure 2. Query Plan for alternative And query
The OR query is a simple selection with multiple Or
filters.
SELECT DISTINCT(u.url)
FROM url u, tf t
WHERE u.docid=t.docid AND
( t.term='keyword1' OR t.term='keyword2' )
For phrase queries, we retrieve all candidate
documents, and the position list for all the keywords
in the query. Candidate documents are the results of
the AND query with the same keywords. The result
relation looks like <docid, position list for first
keyword, position list for second keyword, … >.
SELECT u.url, t1.positionlist, t2.positionlist
FROM url u, tf t1, tf t2
The application logic then traverses the multiple
position lists together to find an instance where the
positions in each list are one apart, in order.
Our ranked query implementation is also due to
Grossman [GFH97]. First, we precompute term IDF
values as log ( number of documents / document
frequency of the term ). The relevance of a given
document d is the summation over all terms t
occurring in both the query and the document:
∑ (query.termfreq for t * t’s IDF * d.termfreq for t *
t’s IDF).
WITH query(term, tf)
AS (values ('keyword1',1),('keyword2',1))
SELECT u.url, SUM (q.tf * i.idf * d.freq * i.idf) as score
FROM url u, query q, tf d, idf i
WHERE u.docid=d.docid AND q.term = i.term AND d.term =
i.term
GROUP BY u.url
ORDER BY score DESC
3.4. Term - Document - Position
The term-document-position approach stores a single
tuple for every single occurrence of a term in a
document. Hence the term t appears 5 times in
document d corresponds to five distinct tuples. The
relations are:
NLJoin
Tablescan
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND
t1.term=’keyword1’ and t2.termid=’keyword2’
posting <term, docid, position>
idf<term, idf>
url<url, docid>
For Boolean and ranked queries, this representation is
redundant, compared to the term-document approach.
At query time, we must insert the distinct operator in
our query plan to eliminate duplicates in our join
results. However, this representation leads to
straightforward SQL translation of phrase and
proximity queries. There is no need for application
logic or custom user-defined functions to post-process
position lists for positional matches. Positional
matches are specified as SQL arithmetic predicates
relating the position attributes.
The Boolean queries are very similar to the TermDocument approach, except for the addition of
distinct operators.
Equi-join And:
SELECT DISTINCT(u.url)
FROM url u, posting t1, posting t2
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND
t1.term='keyword1' AND t2.term='keyword2'
Or:
SELECT DISTINCT(u.url)
FROM url u, posting d
WHERE u.docid=d.docid AND
( d.term='keyword1' OR d.term='keyword2' )
Phrase queries:
SELECT distinct(u.url)
FROM url u, posting t1, posting t2
WHERE t1.docid=u.docid AND t1.docid=t2.docid AND
t1.term='keyword1' AND t2.term='keyword2' AND
t2.position=t1.position+1
Ranked queries:
WITH query(term, tf)
AS (values ('keyword1',1),('keyword2',1)) SELECT u.url,
SUM (q.tf * i.idf * i.idf) as score
FROM url u, query q, posting d, idf i
WHERE u.docid=d.docid AND q.term = i.term AND d.term =
i.term
GROUP BY u.url
ORDER BY score DESC
4. SYSTEM EVALUATION
The systems are implemented on a Pentium III
800MHz workstation with 1GB of RAM running
Windows 2000. The baseline IR system is Lucene
version 1.2 running on JDK 1.3. The relational
database is IBM DB2 UDB Enterprise Edition version
7.2. Our first relational approach uses the IBM DB2
Text Information Extender version 7.2.
4.1. Dataset
Our dataset consists of 199,932 Reuters newswire
articles from the year 1997. The raw text is 322MB.
The corpus has 895308 distinct tokens after case
folding. There are 28,507,457 distinct term-document
pairs, which is the cardinality of the term-document
relation tf. There are 51,108,145 tokens in the corpus,
which is the cardinality of the term-documentposition relation posting.
The Reuter documents are wrapped in XML. We strip
the XML tags to produce the textual body, which are
loaded into the Lucene index and the DB2 TIE
relation. DB2 TIE’s tokenization process is a black
box to us, the end user. However, by default, queries
do use case folding. Fancier features such as
stemming and thesaurus can be specified in the search
function invocation. Using Lucene, we can specify
options of case folding, stemming, etc at index
creation and search time. For a uniform comparison of
our four alternatives, we want uniform tokenization.
Since DB2 TIE has case folding enabled by default,
we use it as our common denominator. We build the
Lucene index with case folding turned on. To produce
uniform tokenization in our two relational
implementations, the term-document and termdocument-position relations are populated from
inverted index probe of the Lucene index. We use the
Lucene IndexReader class to enumerate all termdocument-position information and produce the
appropriate bulk load file for our two relational
representations. Such an approach guarantees that we
have uniform tokenization using only case-folding
feature across our four implementations.
Table 1 Space utilization
Raw text
Lucene
DB2 TIE
Term-Doc
Term-Doc-Pos
Table
Index
337%
429%
133%
99%
389%
783%
Total
100%
133%
726%
1212%
DB2’s table size estimation utility could not estimate
the size of the fulltext base relation because the text
body is stored in a BLOB attribute, rather than inline
as varchar.
4.2. Queries
We test Boolean (And/Or), phrase and ranked queries.
Our queries are 1, 2 or 4 keywords long. We divide
each query class into subclasses of 3 different
selectivities of approximately 1 document hit
(0.0005% of corpus), 10 hits (0.005% of corpus), and
100 hits (0.05% of corpus). For each subclass, we
generate three distinct queries and measure the
average query execution time.
Table 2 Sample of Queries Executed and Selectivities
(expected number of hits)
hits AND
1
OR
Phrase
Rank
Gwil
Photronics
Exceeding
Photronics
Industries (1) Yomazzo (6) Consensus (2) Yomazzo (6)
Gaulle
Quebec (1)
Videoserver West life (1) Videoserver
Tandberg (4)
Tandberg (4)
Zygo Systems Genhold
(1)
Femco (2)
Scotland
Bancorp (1)
Genhold
Femco (2)
Queries are generated by sampling sets of 1, 2, or 4
keywords from the headlines of the first thousand
documents in the corpus, then repeatedly probing the
Lucene index until the desired selectivity for the
query subclass is achieved. For the 4 keyword Or
queries, we were unable to generate queries of
selectivity 1 or 10, due to the unionization semantics
of disjunctive queries.
We want to measure a uniform response time of
keyword queries. Our standard is execution time
between the when the keywords are submitted by the
user, to when all result document URLs have been
returned to the user. For the Lucene implementation,
we measure the time between when our Java Lucene
IndexSearcher instance receives the command line
keyword parameters and search option, to when it
completes retrieval of hits from the index. For our
three database implementations, we build an
embedded SQL application that takes in command
line keyword parameters and search options,
translates the query into appropriate SQL, connects to
the database, executes the query and retrieves
document URLs. The execution time of the embedded
SQL application is measured.
5.
RESULTS & OVSERVATIONS
[The following abbreviations are used in the tables:
TIE: IBM DB2 Text Information Extender
TD: Term-Doc relational model
TDP: Term-Doc-Pos relational model ]
Table 3 And Queries Class Average Execution Time in
seconds
1 word - 1 hit
1 word - 10 hit
1 word - 100 hit
2 word - 1 hit
2 word - 10 hit
2 word - 100 hit
4 word - 1 hit
4 word - 10 hit
4 word - 100 hit
Lucene
TIE
TD
TDP
0.359
0.729
3.391
0.375
0.578
2.078
0.526
0.739
2.954
1.094
1.432
4.474
0.557
4.343
4.900
1.396
2.380
3.895
0.406
0.813
2.781
1.250
1.922
2.078
1.063
2.125
3.870
0.448
0.448
0.661
0.651
0.662
0.802
2.083
2.573
4.453
Table 4 Or Queries Class Average Execution Time in
seconds
1 word - 1 hit
1 word - 10 hit
1 word - 100 hit
2 word - 1 hit
2 word - 10 hit
Lucene
TIE
TD
TDP
0.359
0.729
3.391
0.385
0.719
1.094
1.432
4.474
0.343
0.446
0.406
0.813
2.781
0.442
0.469
0.448
0.448
0.661
112.7
113.5
2 word
4 word
4 word
4 word
- 100 hit
- 1 hit
- 10 hit
- 100 hit
2.500
0.501
0.693
115.9
3.016
0.922
0.805
120.5
Table 5 Phrase Queries Class Average Execution Time
in seconds
1 word - 1 hit
1 word - 10 hit
1 word - 100 hit
2 word - 1 hit
2 word - 10 hit
2 word - 100 hit
4 word - 1 hit
4 word - 10 hit
4 word - 100 hit
Lucene
TIE
TD
TDP
0.359
0.729
3.391
0.391
0.614
2.552
0.578
0.771
3.047
1.094
1.432
4.474
5.482
6.774
4.067
1.453
3.104
3.562
0.406
0.813
2.781
0.406
0.813
2.781
1.359
2.833
3.458
0.448
0.448
0.661
0.818
0.906
2.354
2.541
12.68
13.52
Table 6 Rank Queries Class Average Execution Time in
seconds
Lucene
1 word - 1 hit
0.359
1 word - 10 hit
0.729
1 word - 100 hit
3.391
2 word - 1 hit
0.385
2 word - 10 hit
0.719
2 word - 100 hit
2.500
4 word - 1 hit
4 word - 10 hit
4 word - 100 hit
3.016
* did not finish in 10 minutes
TIE
TD
TDP
0.635
1.099
2.934
0.339
0.370
0.432
0.492
0.526
0.708
13.55
13.47
16.19
∞*
∞*
∞*
257.6
284.6
261.4
0.523
15.06
307.4
5.1 Lucene
Lucene, being a specialized IR system, performs
consistently well across the board on all four types of
queries. It has fast response time, and scales well over
the query size. However, performance deteriorates
slightly as the expected result size increases.
5.2 IBM DB2 Text Information Extender (TIE)
TIE produced performance numbers comparable to
that of Lucene on And queries, slightly better on Or
queries, and slightly worse on Phrase queries. We are
not able to make direct comparison on their Rank
query performance as TIE returns only documents
containing all query terms whereas the 3 other
systems requires only 1 keyword match. However,
judging from its performance on single-keyword
queries, its response time is comparable to that of
Lucene. Furthermore, TIE response time varies
significantly across different queries from the same
query class, a characteristic not seen in the other
systems.
5.3 Term-Doc
The system using the Term-Doc relational model
performed comparably with Lucene and TIE on And,
Or and Phrase queries. It also appears to scale well
(sub-linearly) on query and result size over those
same types of queries. It also performs competitively
on single-keyword ranked queries, but performance
degrades significantly on ranked queries with 2 or
more keywords. This is apparently due to the
optimizer choosing a different query plan on these
queries. However, performance still appears to scale
gracefully as query and result size increase.
5.4 Term-Doc-Pos
The system based on the Term-Doc-Pos relational
model produced reasonably fast response time on And
and Phrase queries, though efficiency deteriorates
notably with the increase in the expected number of
hits as well as the query size. The system was much
slower than the rest on Or and Rank type queries. In
fact, with single-keyword ranked queries, the system
does not respond within 10 minutes. Upon further
investigation, it was found that the query optimizer
was picking some unreasonably bad plans, involving
multiple sorts of large relations. We plan to look into
improvements in these areas in the future. Even with
the downfall, this approach seems to scale reasonably
well and is quite insensitive to the query or result
sizes.
Figure 3. And query with 2 keywords
With regard to Phrase queries, all four systems
finished within 7 seconds, with all except TIE
producing sub-3 second response times (Figure 4).
Note that both TIE and Term-Doc exhibits good
scaling characteristics, as compared to Lucene and
Term-Doc-Pos systems.
5.5 Comparisons
[The following abbreviations are used in the charts:
DB2text: IBM DB2 Text Information Extender
DB2term_doc: Term-Doc relational model
DB2term_doc_pos: Term-Doc-Pos relational model ]
All four systems perform well on And queries (Figure
3), with response times no more than 5 seconds. It
should be noted that both the Term-Doc and TermDoc-Pos models produced numbers comparable to
that of Lucene. In fact, both appear to scale more
gracefully on increasing size of the result set.
Figure 4. Phrase query with 4 keywords
With Rank queries, we see where the two specialized
systems have their advantages (Figure 5, note the
logarithmic scale). Lucene and TIE significantly
outperform the two systems built on relational
models. One reason behind the disparity is that the
relational system performs extensive sorting
operations on large relations. We also discovered that
the DB2 query optimizer could have picked better
plans if more indices/constraints were available, and
we plan to investigate this in the future.
queries were clearly being executed on sub-optimal
plans, hence some amount of database tuning may
result in significant performance boost. Since most IR
systems are used interactively, and users typically
process result hits in batches, it is often useful to
optimize for top-K results. Most database vendors
provide language constructs to specify this constraint
and to utilize it in query processing for better
efficiency (e.g. reducing the size of sort results).
7.
REFERENCES
[BCC94] E. W. Brown, J. P. Callan, and W. B. Croft. Fast
Incremental Indexing for Full-Text Information Retrieval.
In Proceedings of the 20th International Conference on
Very Large Databases, 1994.
Figure 5. Rank query with 2 keywords
6. CONCLUSIONS AND FUTURE WORK
The Term-Doc representation offers competitive
performance on Boolean and phrase queries compared
with special-purpose IR system Lucene. Hence one
may choose to incur the storage overhead to gain the
advantages of a relational implementation, such as
portability, parallelism, and the ability to query over
both unstructured text and structured metadata. In
general, the Term-Doc representation offers better
performance
than
the
Term-Doc-Position
representation. DB2 TIE provides comparable
performance as Lucene, though it incurs significantly
higher space overhead by storing the base text in a
relation, as well as the inverted index. If a workload
consists mostly of ranked queries, then Lucene or TIE
should be used, as the DB2 optimizer seems to be
using sub-optimal plans for the two relational
implementations.
There are a number of avenues for future work.
Additional query classes to be investigated include
proximity queries and wild-card queries. We
measured the query execution times in isolation. We
may want to measure executions of sustained query
workloads. Another important question is index
update performance. It is conceivable that RDBMS
page layout and B-tree index may be more efficient
for inverted index insertion than the traditional IR
approach of maintaining a small update list and
reorganizing the entire index periodically. Traditional
IR search engines are not well suited for high-insert
environments. We would like to find out if RDBMS
approaches are more attractive in such a setting. On
the performance front, a number of our RDBMS
[BR95] E. W. Brown. Execution Performance Issues in
Full-Text Information Retrieval. Ph.D. Thesis, University
of Massachusetts, Amherst, 1995.
[DDS95] S. DeFazio, A. Daoud, L. Smith, J. Srinivasan, B.
Croft, and J. Callan. Integrating IR and RDBMS using
cooperative indexing. In Proceedings of the 18th
International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1995.
[DIX01] P. Dixon. Basics of Oracle Text Retrieval. IEEE
Data Engineering Bulletin, December 2001.
[GBS01] T. Grabs, K. Böhm, and H.-J.Schek. PowerDBIR:
Information Retrieval on Top of a Database Cluster. In
Proceedings of 10th ACM International Conference on
Information and Knowledge Management, 2001.
[GFH97] D. A. Grossman, O. Frieder, D. O. Holmes, and
D. C. Roberts. Integrating structured data and text: A
relational approach. In Journal of the American Society for
Information Science, 1997.
[HN01] J. Hamilton, and T. Nayak. Microsoft SQL Server
Full-Text Search. IEEE Data Engineering Bulletin,
December, 2001.
[KS95] H. Kaufmann, and H.-J. Schek. Text Search Using
Database Systems Revisited - Some Experiments. In
Proceedings of the 13th British National Conference on
Databases, 1995.
[LFH99] C. Lundquist, O. Frieder, D. O. Holmes, and D.
A. Grossman. A Parallel Relational Database Management
System Approach to Relevance Feedback in Information
Retrievel. In Journal of the American Society for
Information Science, 1999.
[LS88] C. A. Lynch, and M. Stonebraker. Extended UserDefined Indexing with Application to Textual Databases. In
Proceedings of the 14th International Conference on Very
Large Databases, 1988.
[MS01] A. Maier, and D. Simmen. DB2 Optimization in
Support of Full Text Search. IEEE Data Engineering
Bulletin, December 2001.
[RAG01] P. Raghavan. Structured and Unstructured Search
in Enterprises. IEEE Data Engineering Bulletin, December
2001.