Download ppt slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Automated Ranking of
Database Query Results
Sanjay Agarwal, Surajit Chaudhuri, Gautam Das,
Aristides Gionis
Presented by
Mahadevkirthi Mahadevraj
Sameer Gupta
Contents






Introduction
Different ranking functions
Breaking ties
Implementation
Experiments
Conclusion
Introduction


Automated ranking of the results of the query is popular
aspect of IR.
Database system support only a boolean query model.
 Empty answers
- Too selective queries.
 Many answers
- Too broad queries.
Automated Ranking functions for the ‘Empty
Answers Problem’

IDF Similarity
- Mimics the TF-IDF concept for Heterogeneous data.

QF Similarity
- Utilizes workload information.

QFIDF Similarity
- Combination of QF and IDF.
Inverse Document Frequency (IDF)

IR technique
Cosine similarity between Q and D is normalized dot product of these two vectors.
where
Q = set of key words
D = set of documents
IDF(Inverse Document Frequency) is defined as IDF(w)=log(N/F(w))
where
N = number of documents.
F(w) = number of documents in which ‘w’ appears.
IDF implies most commonly occurring words convey less information
The Cosine similarity may be further refined by scaling each component with the IDF of corresponding
word
IDF Similarity
Database(only categorical attribute)
T=<t1,……tm>
Q=<q1,…...qm> Condition of the form “WHERE A1=q1 AND … AND Am=qm “
IDFk(t)=log(n/Fk(t))
n-number of tuples in database
Fk(t) -Frequency of tuples in database where Ak=t
For a pair of values ‘u’ and ‘v’ in Ak, Sk (u,v) is defined as
a1
t1
IDFk (u) if u =v and 0 otherwise . Similarity between T and Q : :
m
SIM (T , Q)   S k (t k , q )
k 1
k
t3
Sum of corresponding similarity coefficients over all attributes: :
•TF is irrelavant
Similarity function known as IDF similarity
a2
:: ::
a4
IDF Similarity Example
Consider a query
Select car from automobile_database
Where type=“convertible” and manufacturer=“Nissan”;
- Convertible is rare and hence IDF is high.
Numerical V/s Categorical Data
Consider a query
Select house from house_database
Where price=$300,000 and no_bedrooms=10;
Numeric IDF Similarity Example
Consider a query :
Select house from house_database where no_of_rooms=4
Here v=4

Attribute b1
No of
Rooms(u)
Diff ( |u-v|+1)
Sim= 1/Diff
Output Attr
B1
5
2
0.5
B4
B2
11
6
0.166667
B1
B3
9
4
0.25
B3
B4
4
1
1
B2
Generalizations of IDF similarity

For numeric data

Inappropriate to use previous categorical similarity coefficients.

frequency of numeric value depends on nearby values.

Discretizing numeric to categorical attribute is problematic.
 Solution:

{t1,t2…..tn} be the values of attribute A.For every value t,
sum of”contributions” of t from every other point ti
contributions modeled as gaussian distribution

Similarity function is
bandwidth parameter
Other Generalisations
Let query ‘q’ have a C – condition where C is generalized
as “A in Q” where Q is a set of values for categorical
attributes or a range [lb,ub] for numerical attributes.
The generalized similarity function is as shown :
Problems with IDF Similarity
Problem :
In a realtor database, more homes are built in
recent years such as 2007 and 2008 as compared to 1980
and1981.Thus recent years have small IDF. Yet newer
homes have higher demand.
Solution : QF Similarity.
QF Similarity : leveraging workloads

Importance of attribute values is determined by frequency
of their occurrence in workload.

In the example above , it is reasonable to assume that more
queries are requesting for newer homes than for older
homes. Thus the frequency of the year 2008 appearing in
the workload will be more than that of year 1981.
QF Similarity

For categorical data

query frequency QF(q)=
raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q)
_______________________________________________________________
raw frequency of most frequently occuring value in workload (RQFMax)
s(t,q)= QF(q), if q=t
0
, otherwise
QF similarity example
Consider a workload where attribute
A= { 1,1,2,3,4,5,5,5,5,2}
If now a query requests for A=1, then
QF (1) = RQF(1)/RQFMax = 2/4 .
If a query requests for an attribute value not in the
workload, then QF=0.
QF similarity : Different Attributes


Similarity between pairs of different categorical attribute
values can also be derived from workload
eg. To find S(TOYOTA CAMRY,HONDA ACCORD)
The similarity coefficient between tuple and query in this
case is defined by jaccard coefficient scaled by QF factor
as shown below.
S(t,q)=J(W(t),W(q))QF(q)
Analyzing workloads
 Analyzing
IN clauses of queries:
If certain pair of values often occur together in the
workload ,they are similar .e.g. queries with C as
“MFR IN {TOYOTA,HONDA,NISSAN}”
 Several
recent queries in workload by a specific user
repeatedly requesting for TOYOTA and HONDA.
 Numerical
values that occur in the workload can also
benefit from query frequency analysis.
QFIDF Similarity

QF is purely workload-based. Big disadvantage for insufficient or
unreliable workloads.

For QFIDF Similarity

S(t,q)=QF(q) *IDF(q) when t=q

where QF(q)=(RQF(q)+1)/(RQFMax+1).
Thus we get small non zero value even if value is never referenced in
workload model
Breaking ties using QF


Problem: Many tuples may tie for the same similarity score and get
ordered arbitarily.Arise in empty and many answers problem.
Solution: Determine the weights of missing attribute values that
reflect their “global importance” for ranking purposes by using
workload information.

Extend QF similarity ,use quantity
to break ties.
Consider a query requesting for 4 bedroom houses .
- Arlington is less important than dallas .
Problems with Breaking ties using IDF
large IDF scenario for missing attributes.
- Arlington homes are given more preference than dallas
homes since Arlington has a higher IDF, but this scenario is not true
in real practice.
Small IDF scenario for missing attributes.
- Consider homes with decks , but since we are considering smaller
IDF preference will be given to homes without decks since they have
a smaller IDF which is not true in real practice.
Implementation
Pre-processing component
 Query–processing component

Pre-processing component

Compute and store a representation of similarity function in auxiliary
database tables.

For categorical data, compute IDF(t) (resp QF(t)) ,to compute frequency
of occurences of values in database and store the results in auxillary
database tables.
 For numeric data, an approximate representation of smooth function
IDF() (resp(QF()) is stored, so that function value is retrieved at runtime.
Query processing component

Main task: Given a query Q and an integer K, retrieve Top-K tuples
from the database using one of the ranking functions.

Ranking function is extracted in pre-processing phase.
 SQL-DBMS functionality used for solving top-K problem.

Handling simpler query processing problem

Input: table R with M categorical columns, Key column TID, C is
conjunction of form Ak=qk..... and integer K.
 Output: top-K tuples of R similar to Q.
 Similarity function: Overlap Similarity.
Implementation of Top-K operator

Traditional approach ?

Indexed based approach



overlap similarity function satisfies the following monotonic property. Adapt Fagin’s TA
algorithm. If T and U are two tuples such that for all K, Sk(tk,qk)< Sk(uk,qk)
then SIM(T,Q) < SIM(U,Q)
To adapt TA implement Sorted and random access methods.
Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by
random access and maintains buffer of Top-K tuples seen so far.
Threshold Algorithm (TA)
Read all grades of an object once seen from a sorted access
• No need to wait until the lists give k common objects
Do sorted access (and corresponding random accesses) until you
have seen the top k answers.
• How do we know that grades of seen objects are higher than
the grades of unseen objects ?
• Predict maximum possible grade unseen objects:
L2
L1
Seen
Possibly unseen
a: 0.9
d: 0.9
b: 0.8
a: 0.85
c: 0.72
.
.
.
f: 0.65
.
d: 0.6
b: 0.7
.
f: 0.6
.
.
.
c: 0.2
T = min(0.72, 0.7) = 0.7
Threshold value
Example – Threshold Algorithm
L1
(a, 0.9)
(b, 0.8)
Step 1: - parallel sorted access to each list
For each object seen:
- get all grades by random access
- determine Min(A1,A2)
- amongst 2 highest seen ? keep in buffer
L2
(d, 0.9)
ID
A1
A2
Min(A1,A2)
a
0.9
0.85
0.85
d
0.6
0.9
0.6
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
Example – Threshold Algorithm
Step 2: - Determine threshold value based on objects currently
seen under sorted access. T = min(L1, L2)
- 2 objects with overall grade ≥ threshold value ? stop
else go to next entry position in sorted list and repeat step 1
L1
L2
a: 0.9
d: 0.9
b: 0.8
ID
A1
A2
Min(A1,A2)
a
0.9
0.85
0.85
d
0.6
0.9
0.6
a: 0.85
c: 0.72
b: 0.7
.
.
.
.
.
.
.
.
d: 0.6
c: 0.2
T = min(0.9, 0.9) = 0.9
Example – Threshold Algorithm
Step 1 (Again): - parallel sorted access to each list
For each object seen:
- get all grades by random access
- determine Min(A1,A2)
- amongst 2 highest seen ? keep in buffer
L2
L1
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
Min(A1,A2)
a
0.9
0.85
0.85
d
0.6
0.9
0.6
b
0.8
0.7
0.7
Example – Threshold Algorithm
Step 2 (Again): - Determine threshold value based on objects currently
seen. T = min(L1, L2)
- 2 objects with overall grade ≥ threshold value ? stop
else go to next entry position in sorted list and repeat step 1
L1
L2
a: 0.9
d: 0.9
b: 0.8
ID
A1
A2
Min(A1,A2)
a
0.9
0.85
0.85
b
0.8
0.7
0.7
a: 0.85
c: 0.72
b: 0.7
.
.
.
.
.
.
.
.
d: 0.6
c: 0.2
T = min(0.8, 0.85) = 0.8
Example – Threshold Algorithm
Situation at stopping condition
L1
L2
a: 0.9
d: 0.9
b: 0.8
ID
A1
A2
Min(A1,A2)
a
0.9
0.85
0.85
b
0.8
0.7
0.7
a: 0.85
c: 0.72
b: 0.7
.
.
.
.
.
.
.
.
d: 0.6
c: 0.2
T = min(0.72, 0.7) = 0.7
Indexed-based TA(ITA)
Sorted access
Random
access
Indexed-based TA(ITA)
Stopping Condition
Hypothetical tuple – current value a1,…, ap for A1,… Ap,
corresponding to index seeks on L1,…, Lp and qp+1,…..
qm for remaining columns from the query directly.
Termination – Similarity of hypothetical tuple to the
query< tuple in Top-k buffer with least similarity.
ITA for Numeric columns
Consider a query has condition Ak = qk for a numeric column
Ak.
Two index scan is performed on Ak.
- First retrieve TID’s > qk in incresing order.
- Second retrieve TID’s < qk in decreasing order.
We then pick TID’s from the merged stream.
ITA can be extended for IN,range queries . Additional
challenges arise for ranking function over set of tables.
Experiments
Ranking functions
quality v/s workload
Experiments
Varying no of attributes
Varying K in Top-K
Conclusion

Automated Ranking Infrastructure for SQL databases.

Extended TF-IDF based techniques from Information
retrieval to numeric and mixed data.

Implementation of Ranking function that exploited
indexed access (Fagin’s TA)