Download Probabilistic Ranking of Database Query Results

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International University
Gerhard Weikum, MPI Informatik
Presented by: Kiran Karnam

Introduction & Motivation

Problem Definition

Architecture

Ranking Function

Implementation

Experiments

Conclusions & Limitations



Many-answers problem
Two alternative solutions:
Query reformulation
Automatic ranking
Apply probabilistic model in IR to DB tuple
ranking

Many answers problem
SELECT * FROM REALTOR_DB
WHERE CITY=‘SEATTLE’ ;

Query reformulation

Automatic ranking
 Specified
Attributes
city
 Unspecified Attributes
View
School District
Boat Dock

Global Score:
Global score which captures the global importance
of unspecified attribute values.
Eg: VIEW=‘WATERFRONT’

Conditional Score:
which captures the strengths of dependencies (or
correlations) between specified and unspecified attribute
values.
Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’

Important Rules and Theorem required

Bayes’ Rule:
p(a/b) = [ p(b/a) p(a) ] / [p(b)]

Product Rule:
p(a,b/c) = p(a/c) * p(b/a,c)

Bayes theorem shows the relation between two
conditional probabilities which are the reverse of
each other

The probability of an event A given an event B
depends not only on the relationship between
events A and B but on the marginal probability (or
"simple probability") of occurrence of each event

Document (Tuple) t, Query Q
R: Relevant Documents
R = D - R: Irrelevant Documents

Tuple t is considered as a document

Partition t into t(X) and t(Y)

t(X) and t(Y) are written as X and Y

Derive from initial scoring function until final
ranking function is obtained

Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the
X and Y values are allowed

If Many Queries Specify Set X of Conditions then there is
Preference Correlation between Attributes in X.

Global: E.g., If Many Queries ask for Waterfront then
p(Waterfront=TRUE) is high.

Conditional: E.g., If Many Queries ask for 4-Bedroom Houses
in Good School Districts, then p(Bedrooms=4 |
SchoolDistrict=`good’), p(SchoolDistrict=`good’ |
Bedrooms=4) are high.

Final Ranking Formula is
Where:
p(y|W) = Relative frequency of unspecified attribute ‘y’
given workload ‘W’
p(y|D)= Relative frequency of unspecified attribute ‘y’
given data base ‘D’
p(x|y,W)=Frequency of correlation between x and y in W
P(x|y,D)=Frequency of correlation between x and y in D


Pre processing
◦ Atomic probability module
◦ Index module

Intermediate Knowledge Reference layer

Query processing
◦ Scan algorithm
◦ List merge algorithm
 Computation
of modules:
p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for
all distinct values of x and y.
 Storing these atomic probabilities as database tables
in intermediate knowledge representation layer with
appropriate indexes.
 Computation of index module resulting in conditional
and global lists table.

CONDITIONAL LISTS Cx:
Contains <TID, CondScore> in descending order

GLOBAL LISTS Gx:
Contains <TID,GlobScore> in descending order

Select Tuples that Satisfy the Query

Scan and Compute Score for Each Result-Tuple

Return Top-K Tuples
Scan algorithm is Inefficient
Many tuples in the answer set

Another approach
Pre-compute top-K tuples for all possible queries
Still infeasible in practice

Trade-off solution
Pre-compute ranked lists of tuples for all possible atomic queries
At query time, merge ranked lists to get top-K tuples

Databases Used
◦ MSN Home Advisor database
(http://houseandhome.msn.com/)
◦ Internet Movie Database

Software and Hardware:
•
Microsoft SQL Server2000 RDBMS
•
P4 2.8-GHz PC, 1 GB RAM
•
C#, Connected to RDBMS through DAO

Quality Experiments

Performance Experiments
Query: select * from SeattleHomes where
City=‘Seattle’ and Bedroom=1;
Conditional ranked condos with garages the
highest
 Global failed to recognize importance of the
unspecified attribute Garage=‘Y’


User preference of rankings


5 new queries
Users were given the top-5 results

Compare 2 algorithms
◦ Scan algorithm
◦ List Merge algorithm

Execution time of performance algorithms

Completely Automated Approach for the Many-Answers
Problem which Leverages Data and Workload Statistics
and Correlations
LIMITATION:
Existence of correlations between text and non-text data.
Future Work
 Empty-Answer Problem
 Handle Plain Text Attributes




Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum,
Probabilistic Ranking of Database Query Results, VLDB 2004.
users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt
http://crystal.uta.edu/~cse6339/Fall08DBIR.htm
http://crystal.uta.edu/~cse6339/Fall09DBIR.htm
Document related concepts

Factorization of polynomials over finite fields wikipedia, lookup

Genetic algorithm wikipedia, lookup

Pattern recognition wikipedia, lookup

Birthday problem wikipedia, lookup

Operational transformation wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Mathematical optimization wikipedia, lookup

Algorithm wikipedia, lookup

Computational complexity theory wikipedia, lookup

Theoretical computer science wikipedia, lookup

Probabilistic context-free grammar wikipedia, lookup

Algorithm characterizations wikipedia, lookup

Simulated annealing wikipedia, lookup