Download Effective Keyword-Based Selection of Relational Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DBase wikipedia , lookup

Encyclopedia of World Problems and Human Potential wikipedia , lookup

Commitment ordering wikipedia , lookup

Global serializability wikipedia , lookup

Microsoft Access wikipedia , lookup

IMDb wikipedia , lookup

Serializability wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

PL/SQL wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Oracle Database wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Functional Database Model wikipedia , lookup

Ingres (database) wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational algebra wikipedia , lookup

Versant Object Database wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Effective Keyword-Based
Selection of Relational
Databases
By Bei Yu, Guoliang Li, Karen Sollins &
Anthony K. H. Tung
Presented by Deborah Kallina
Introduction

The database selection problem for relational
data sources


Demand for keyword-based searches of
structured databases
What about searches over multiple structured
databases?
Current Approach


Document collections are summarized with keyword
frequency lists
Inadequate because…


Relational tables are typically normalized
Results must consider number of join operations
Example:
query Q = {multimedia,
database, VLDB}
The Usefulness of a Database in
Answering a Keyword Query


A database is useful if it has high quality
results to the keyword query.
What are “high quality” results?


The results contain all the query keywords.
There is a keyword relationship – the query
keywords can be connected meaningfully in the
database.
This paper proposes…


A method that summaries the relationships
between keywords in a database.
A ranking method based on the keyword
summaries in order to select the most useful
databases for a given keyword query.
Why Summaries?

We have an equation which measures the
usefulness of a database DB to query Q

But calculating this score isn’t feasible for every
database in the system!
So we estimate the usefulness of the database
using a summary of the relationships between all
pairs of keywords in the database.

Results of a Keyword Query

A minimal tree of tuples containing
all the query keywords
* tuples in the tree are connected
according to their relationships
defined in the database schema


The more distant the relationships,
the weaker the relationships
connecting the tuples.
distance - the number of join
operations in a joining sequence of
tuples
Examples:



t1
t2
t3, distance = 2
t4, distance = 0
DB2 of Figure 1,
Q = {multimedia, binder}


t1
t5
t12 or t4
t9
t12,
distance = 2
t15
t10
t1
t5
t12,
distance = 4
The Matrices

Each relational database DB has several types of matrices



one D matrix – represents the presence or absence of each
keyword in each tuple in DB
multiple T matrices – represents the relationship between tuples
in a relational database at each distance
construct the Wd matrices – the frequencies of keyword pairs at
distance d; Example: Figure 3
Calculate the Key Relationship Matrix (KRM)


wd(ki, kj) is the frequency of d-distance joining sequences to
connect the pair of keywords ki and kj
is the maximum number of join operations (user-specified)
Example: Figure 4
Q = {database:VLDB} for DB2
in Figure 3, the only non-zero is when d = 2
1/(2+1) * 1 occurrence = 0.33
Implementation in SQL



The generation of the matrices: D, T1, T2, … T and
W0, W1, … , W , for each DB, can be performed
conveniently inside the local RDBMS using SQL.
Tables are created for each matrix.
Tables can be maintained dynamically when tuples
are added to the database or old tuples are deleted.
Data Sources Selection Using KRM


Estimate the score of the data source DB to
Q through the scores of the individual pairs.
With the KR-summary, we can effectively
rank a set of databases. A higher score of the
data source DB to Q indicates a better
ranking.
Definitions & Metrics for Comparing
Experimental Results



real ranking – the “real” score of each database
based on Equation 2-1
recall – compares the accumulated score of the top
l databases selected based on the summaries of
the source databases against the total available
score when we select top l databases according to
real ranking (summaries vs. real ranking)
precision – measures the fraction of the top l
selected databases which have high quality results
Experimental Results
Observations:

The selection performance
of KR-summaries generally
gets better as gets larger.

The precision and recall
performance for different
values of tends to cluster
into groups.

There are big gaps in both
precisions and recalls
between KR-summaries
when 0 ≤ ≤ 1 and when
is greater.
Conclusion
“The estimation method is effective in
assigning high ranks to useful databases,
although less relevant or irrelevant databases
might be selected.”