Download Keyword Search in Relational Databases {Sub

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Keyword Search in Relational Databases
Jaehui Park
Intelligent Database Systems Lab.
Seoul National University
2009. 02. 12.
Outline
 Introduction
 Bibliography
 Fundamental Characteristics
 Research Dimensions

Summary
 Future Direction
Copyright  2009 by CEBT
2
Introduction


Relational databases

A repository for a significant amount of data (e.g. enterprise data)
–
RDBMS
Precise
Structured Query Language (SQL)
–
Precise and complete
–
Difficult for casual users
Easy way of querying
structured data
(Web) documents
–
Collection of unstructured (natural language) documents available online
–
Search engine


Querying
managing an abstract view of underlying data
Querying unstructured data

The most popular application for information discovery
Keyword search
–
Simple and user-friendly
–
Approximating the precise results


Structured
–


Data
Querying structured data
Unstructured
Easy
In statistical and semantic ways
Deep Web

Information over the Web comes out of relational databases
Copyright  2009 by CEBT
3
Introduction

Enabling casual users to query relational databases with keywords

“casual users”
–
Without any knowledge about the schema information
–
Without any knowledge of the query language (SQL)

Search system should have the knowledge in behalf of users
Relational
Databases
keywords
SQL
Results
 Challenges

Inherent discrepancy of data between IR and DB
–
Information often splits across the tables (or tuples) in relational
databases

Ex) A single retrieval unit of information
Copyright  2009 by CEBT
4
Bibliography

Proximity



DataSpot

[Palmon et al., VLDB, 1998] DTL's DataSpot - Database Exploration Using Plain Language

[Palmon et al., SIGMOD, 1998] DTL's DataSpot- database exploration as easy as browsing the Web
DBXplorer




[Goldman et al., VLDB, 1998] Proximity Search in Databases
[Agrawal et al., 2002, ICDE] DBXplorer: a system for keyword-based search over relational databases
BANKS

[Hulgeri et al., 2001, DEBU] Keeyword Search in Databases

[Hulgeri et al., 2002, ICDE] Keyword Searching and Browsing in Databases using BANKS

[Kacholia et al., 2005, VLDB] Bidirectional Expansion For Keyword Search
DISCOVER

[Hristidis et al., 2002, VLDB] DISCOVER: Keyword search in relational databases

[Hristidis et al., 2003, VLDB] Efficient IR-Style Keyword Search over Relational Databases.

[Liu et al., 2006 SIGMOD] Effective Keyword Search in Relational Databases
ObjectRank

[Balmin and Hristidis et al., 2004, VLDB] ObjectRank: Authority-Based Keyword Search in Databases

[Balmin and Hristidis et al., 2008, TODS] Authority-based search on databases
Copyright  2009 by CEBT
5
Proximity
 Proximity

Measure of how related objects are

Object related by a distance function
–
Shortest path computation

document
K-neighborhood distance look-up table
relational database
………
………
………
……
Copyright  2009 by CEBT
6
DataSpot
 Hyperbase
SQL
query

Modeling data graph

Sub-hyperbase as an answer
Relational
Databases
 Best-first searching
keywords
convert
query
Hyperbase
Customers
Customer ID
…
123456
…
Record
Record
Orders
…
Customer ID
…
123456
Field
Field Name
Thesaurus
Stem
“client”
Stem
“customer”
String
Stem
Text
“Customer”
Copyright  2009 by CEBT
Field Value
Key
123456
Text “ID”
7
DBXplorer
 Symbol table index for schema entities

Locating objects efficiently
–
Granularity
–
Compaction
keywords
query
term.
location
…
…
…
…
…
…
Relational
Databases
 Schema graph

Join tree enumeration
–
Joining several tables on the fly
Copyright  2009 by CEBT
8
BANKS
 Directed (data) graph

Backward edge

Graph traversing algorithm
–
NP-hard problem
–
Heuristics

Backward Expanding search

Bi-directional expanding search
 Rich interface
Copyright  2009 by CEBT
9
DISCOVER

High level representation of the architecture for keyword search in
relational databases

Top-k join query processing

Pipeline algorithm
–

Threshold [Fagin et al. 2001]
IR-style ranking function

TF-IDF based tuple ranking
Copyright  2009 by CEBT
10
ObjectRank
 Authority

Measure of how important objects are
–

Authority flow graph
Modified Pagerank algorithm
–
(Global) ObjectRank algorithm
–
Inverse ObjectRank algorithm
Copyright  2009 by CEBT
11
Fundamental Characteristics

Identifying schema elements

To avoid linearly scanning all the tables

Indexing structure
–

Keyword query processing
–

Making the best of the lack of syntax in query keywords
Formalizing internal queries
–
Ranking
Processing
Indexing
Model
k1
e.g. SQL
k2
RDBMS
k3
k4
RDB
Modeling answers

Logical unit of retrieval is not a document
–

Search system
Processing queries


Inverted index
e.g. Directed Acyclic Graph (DAG)
Ranking answers

Assign a single score, which can reflect the semantics of underlying schema, for each
answer

Order the returned answers
Copyright  2009 by CEBT
12
Research Dimensions
 Model
 Data Representation
 Query Representation
 Efficient Processing
 Processing
 Top-k query processing
 Indexing
 Indexing structure
 Ranking
 Ranking
 Presentation
Copyright  2009 by CEBT
13
Data representation (1/4)
 Graph model

PaperID
Data graph
Paper
J.H.Park08
AuthorID
Writes
JHPark
AuthorID
Author

Schema graph
JHPark
PaperName
Web Content Summarization Using …
PaperID
J.H.Park08
SGLee
S.G.Lee08
AuthorName
Jaehui Park
SGLee
Sang-goo Lee
Cites
Paper
Writes
Author
Citing
Cited
…
PaperID
PaperName
…
AuthorID
PaperID
…
AuthorID
AuthorName
…
Copyright  2009 by CEBT
14
Data representation (2/4)
 Data graph

Search time reducing
Finding an optimal answer
–
Heuristics
Size problem
–

RDB
NP-hard : Steiner tree problem


traverse
Efficient graph traversing
–

keywords
Too huge to fit into main memory
Maintenance problem
–
Not appropriate for update-intensive databases
Copyright  2009 by CEBT
15
Data representation (3/4)
 Schema graph

Smaller Size
–

traverse
Query
RDB
Scales well for huge database
Utilize underlying RDBMS facilities
–

keywords
e.g. Database indexes on columns
Exploiting the schema of the underlying database
–
Generating optimal internal queries : SQL
–
Evaluation for Queries
Query keywords : Jaehui Relational Database
-------------------------------------------------Candidate join queries:
Tmp1 : select * from Paper, Writes
where Paper.PaperName = ‘Relational Database’ AND …
Tmp2 : select * from Tmp1, Author
where … Author.AuthorName = ‘Jaehui’ AND …
Copyright  2009 by CEBT
16
Data representation (4/4)
 Graph model

A logical unit of information
–
Subgraph
K2
K1
T1

A set of multiple nodes joined together

may include some tuples that does not
contain any query keywords
T3
T2
K3
T6
T5
K3
T4
K2
K1
T1
T3
T2
K3
T4

Weighting scheme
–
K1
Edges

Distance (or Proximity)

–
K2
T6
T5
T2
Join operations
K2
Nodes

T1
T3
K3
K1
Importance (or Authority)
Copyright  2009 by CEBT
K3
T6
T3
T1
17
Ranking

Relevance


Answer size
–
Minimal subgraph including all the query keywords
–
Distance as the semantics closeness between objects
Writes
Tree Traverse algorithm …
Query Evaluation …
0.8
…

The distance between an entity and its attributes

The distance between tuples in the same table

The distance between tuples related through primary and foreign key
0.2
Jane
Tom
…
Standard IR weighting method

TF-IDF

Text databases (e.g. user complaints, product descriptions, book reviews, etc.)
Cites
Importance

Paper
Term frequency
–

0.4
Authority
–
Authority transfer graph

–
Citing
Cited
…
Paper
0.7
0
PaperID
PaperName
…
Writes
0.2
AuthorID
PaperID
…
Author
0.2
AuthorID
AuthorName
…
Nodes with incoming link with high authority are assumed to have higher importance
Specificity problem

Specific results should be ranked higher than general one

e.g., InverseObjectRank algorithm
Copyright  2009 by CEBT
18
Efficient processing

Indexing structure

Reducing scan time
–
Granularity levels of schema elements


Reducing computation time
–
Precomputation


Column level vs. Record (or Cell) level
edge weights, node weights, relevance
scores, etc.
Query execution technique

Top-k query processing
–
Avoiding creating all query results

Decide which candidate answers will
produce top-k results

e.g. Sparse algorithm
Pipeline algorithm
Copyright  2009 by CEBT
ROWID
a1
a2
a3
…
Score
76
60
15
…
ROWID
b1
b2
b3
…
Score
90
50
12
…
Query representation
 Logical operators

conjunction, disjunction
 Type and condition

Type
–

Find type, Near type
Conditional keywords
–
e.g. Year > 300
Copyright  2009 by CEBT
20
Presentation
 Visualizing search result

e.g. Tree view
–
structural level vs. tuple level
 Limiting maximum size of an answer
 Limiting maximum number of answer
 …
Copyright  2009 by CEBT
21
Summary
 Comparison in a common framework
Data model
Ranking
Efficiency
Query
representation
Presentation
Proximity
Data-graph
Distance
K-neighborhood distance
look-up
Type,
Conjunction
-
DataSpot
Data-graph
Number of edges
-
Conjunction
Table
DBXplorer
Schema-graph
Number of joins
Symbol table
Conjunction
Enumerated rows
BANKS
Data-graph
(directed)
Edge weight, Node
weight
Disk resident index on
keyword
Conjunction
Dynamic Joined
Tree
DISCOVER
Schema-graph
Number of joins
Master Index
Conjunction,
Disjunction
-
ObjectRank
Schema-graph,
Data-graph
Authority
Master Index
Conjunction,
Disjunction
-
Copyright  2009 by CEBT
22
Future Directions

Probabilistic model

Naïve approaches
–

Rank measures on the answer size

Cannot directly estimate the (probability of) relevance between the query and the retrieved
tuples

Heuristic performs well
Probabilistic model
–
e.g. Bayesian belief network

Term-based approach to approximate optimal answer

Modification for dealing with relational database


Efficient query processing


Dependencies between schema elements
Top-k query processing have shown a great impact on performance
–
Ranking function involves aggregation or grouping operator
–
Symbol table design
Conclusion

Various approaches are described with our understanding

We envision the above research directions to be important to pursue.
Copyright  2009 by CEBT
23
Thank you
Copyright  2009 by CEBT
24
Related documents