Download Issues in Bridging DB & IR

Document related concepts

Microsoft Access wikipedia , lookup

Ingres (database) wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Issues in Bridging DB & IR
11/21
Administrivia

Homework 4 socket open



*PLEASE* start working. There may not be a week
extra time before submission
Considering making Homework 4 subsume the
second exam—okay?
Topics coming up



DB/IR (1.5 classes); Collection Selection (.5
classes)
Social Network Analysis (1 class); Webservices (1
class)
Interactive review/Summary (last class)
DB and IR: Two Parallel Universes
Database Systems
Information Retrieval
canonical
application:
accounting
libraries
data type:
numbers,
short strings
text
foundation:
algebraic /
logic based
probabilistic /
statistics based
search
paradigm:
Boolean retrieval
(exact queries,
result sets/bags)
ranked retrieval
(vague queries,
result lists)
parallel universes forever ?
CIDR 2005
DB vs. IR




DBs allow structured
querying
Queries and results
(tuples) are different
objects
Soundness &
Completeness expected
User is expected to
know what she is doing




IR only supports
unstructured querying
Queries and results are
both documents!
High Precision & Recall
is hoped for
User is expected to be a
dunderhead.
Top-down Motivation: Applications (1)
- Customer Support Typical data:
Why customizable
Customers (CId, Name, Address,
Area, Category, scoring?
Priority, ...)
• wealth of different apps within this app class
Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...)
• different customer classes
Answers (AId, RId, Date, Class,
Body, WFId,
WFStatus,
...) needs
• adjustment
to evolving
business
• scoring on text + structured data
(weighted sums, language models, skyline,
premium customer from Germany:
w/ correlations,
etc. a
etc.)
„A notebook, model ... configured
with ..., has
problem with the driver of
its Wave-LAN card. I already tried the fix ..., but received error message ...“
 request classification & routing
 find similar requests
Typical queries:
Platform desiderata (from app developer‘s viewpoint):
• Flexible ranking and scoring on text, categorical, numerical attributes
• Incorporation of dimension hierarchies for products, locations, etc.
• Efficient execution of complex queries over text and data attributes
• Support for high update rates concurrently with high query load
CIDR 2005
Top-down Motivation: Applications (2)
More application classes:
• Global health-care management for monitoring epidemics
• News archives for journalists, press agencies, etc.
• Product catalogs for houses, cars, vacation places, etc.
• Customer relationship management in banks, insurances, telcom, etc.
• Bulletin boards for social communities
• P2P personalized & collaborative Web search
etc. etc.
CIDR 2005
Top-down Motivation: Applications (3)
Next wave Text2Data:
use Information-Extraction technology
(regular expressions, HMMs, lexicons,
other NLP and ML techniques)
to convert text docs into relational facts, moving up in the value chain
Example:
„The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7,
and is organized by D.J. DeWitt, Mike Stonebreaker, ...“
Conference
ConfOrganization
Name Year Location Date
Prob
Name Year Chair Prob
CIDR 2005 Asilomar 05/01/04 0.95
CIDR 2005 P68 0.9
CIDR 2005 P35 0.75
• facts now have confidence scores
• queries involve probabilistic inferences
and result ranking
• relevant for „business intelligence“
CIDR 2005
People
Id Name
P35 Michael Stonebraker
P68 David J. DeWitt
Some specific problems
1.
2.
3.
How to handle textual attributes in data
processing (e.g. Joins)?
How to support keyword-based querying
over normalized relations?
How to handle imprecise queries?
(Ullas Nambiar’s work)
4.
How to do query processing for top-K
results?
(Surajit et. Al. paper in CIDR-2005)
1. Handling text fields in data
tuples


Often you have database relations some of
whose fields are “Textual”
 E.g. a movie database, which has, in
addition to year, director etc., a column
called “Review” which is unstructured text
Normal DB operations ignore this
unstructured stuff (can’t join over them).
 SQL sometimes supports “Contains”
constraint (e.g. give me movies that
contain “Rotten” in the review
STIR (Simple Text in
Relations)


The elements of a
tuple are seen as
Documents (rather
than atoms)
Query language is
same as SQL save a
“similarity” predicate
Soft Joins..WHIRL [Cohen]

We can extend the notion of Joins to “Similarity Joins”
where similarity is measured in terms of vector similarity
over the text attributes. So, the join tuples are output in a
ranked form—with the rank proportional to the similarity
 Neat idea… but does have some implementation
difficulties

Most tuples in the cross-product will have non-zero similarities.
So, need query processing that will somehow just produce
highly ranked tuples


Uses A*-search to focus on top-K answers
(See Surajit et. Al. CIDR 2005 who argue for a whole new query
algebra to help support top-K query processing)
WHIRL queries
• Assume two relations:
review(movieTitle,reviewText): archive of reviews
listing(theatre, movieTitle, showTimes, …): now showing
The
Hitchhiker’s
Guide to the
Galaxy, 2005
This is a faithful re-creation
of the original radio series –
not surprisingly, as Adams
wrote the screenplay ….
Men in
Black, 1997
Will Smith does an excellent
job in this …
Space Balls,
1987
Only a die-hard Mel Brooks
fan could claim to enjoy …
…
…
Star Wars
Episode III
The
Senator
Theater
1:00,
4:15, &
7:30pm.
Cinderella
Man
The
Rotunda
Cinema
1:00,
4:30, &
7:30pm.
…
…
…
WHIRL queries
• “Find reviews of sci-fi comedies [movie domain]
FROM review SELECT * WHERE r.text~’sci fi comedy’
(like standard ranked retrieval of “sci-fi comedy”)
• “ “Where is [that sci-fi comedy] playing?”
FROM review as r, LISTING as s, SELECT *
WHERE r.title~s.title and r.text~’sci fi comedy’
(best answers: titles are similar to each other – e.g.,
“Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s
Guide to the Galaxy, 2005” and the review text is similar
to “sci-fi comedy”)
WHIRL queries
• Similarity is based on TFIDF rare words are most important.
• Search for high-ranking answers uses inverted indices….
The Hitchhiker’s Guide to the Galaxy, 2005
Star Wars Episode III
Men in Black, 1997
Hitchhiker’s Guide to the Galaxy
Space Balls, 1987
Cinderella Man
…
…
WHIRL queries
• Similarity is based on TFIDF rare words are most important.
• Search for high-ranking answers uses inverted indices….
- It is easy to find the (few) items that match on “important” terms
- Search for strong matches can prune “unimportant terms”
The
Star Wars Episode III
Hitchhiker’s Guide to the Galaxy,
2005
Hitchhiker’s Guide to the Galaxy
Men in Black, 1997
Cinderella Man
Space Balls, 1987
…
…
Years are common in the
review archive, so have
low weight
hitchhiker
movie00137
the
movie001,movie003,movie007,movie008,
movie013,movie018,movie023,movie0031,
…..
WHIRL results
• This sort of worked:
– Interactive speeds
(<0.3s/q) with a few
hundred thousand tuples.
– For 2-way joins, average
precision (sort of like area
under precision-recall curve)
from 85% to 100% on 13
problems in 6 domains.
– Average precision better
than 90% on 5-way joins
WHIRL and soft integration
• WHIRL worked for a number of
web-based demo applications.
– e.g., integrating data from 30-50
smallish web DBs with <1 FTE
labor
• WHIRL could link many data
types reasonably well, without
engineering
• WHIRL generated numerous
papers (Sigmod98, KDD98,
Agents99, AAAI99, TOIS2000,
AIJ2000, ICML2000, JAIR2001)
• WHIRL was relational
– But see ELIXIR (SIGIR2001)
• WHIRL users need to know
schema of source DBs
• WHIRL’s query-time linkage
worked only for TFIDF, tokenbased distance metrics
–  Text fields with few
misspellimgs
• WHIRL was memory-based
– all data must be centrally
stored—no federated data.
–  small datasets only
WHIRL vision:
very radical,
everything was
inter-dependent
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b
Link items as
needed by Q
Incrementally produce a
ranked list of possible links,
with “best matches” first. User
(or downstream process)
decides how much of the list to
generate and examine.
(~ TFIDF-similar)
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
William
David
Cohen
Cohn
String Similarity Metrics

Tf-idf measures are not really very good at
handling similarity between “short textual
attributes” (e.g. titles)


String similarity metrics are more suitable
String similarity can be handled in terms of


Edit distance (# of primitive ops such as
“backspace”, “overtype”) needed to convert one
string into another
N-gram distance (see next slide)
N-gram distance

An n-gram of a string is a contiguous n-character
subsequence of the string

3 grams of string “hitchhiker” are



“space” can be treated as a special character
A string S can be represented as a set of its n-grams

Similarity between two strings can be defined in terms of the
similarity between the sets


{hit, itc, tch, chh, hhi, hik, ike, ker}
Can do jaccard similarity
N-grams are to strings what K-shingles are to
documents

Document duplicate detection is often done in terms of the
set similarity between its shingles

Each shingle is hashed to a hash signature. A jaccard similarity
is computed between the document shingle sets

Useful for plagiarism detection (see Turnitin software does it..)
Performance
2. Supporting keyword search
on databases
How do we answer
a query like
“Soumen Sunita”?
Issues:
--the schema is normalized
(not everything in one table)
--How to rank multiple tuples
which contain the keywords?
Motivation


Keyword search of documents on the Web has been enormously
successful
 Simple and intuitive, no need to learn any query language
Database querying using keywords is desirable
 SQL is not appropriate for casual users
 Form interfaces cumbersome:
 Require separate form for each type of query —
confusing for casual users of Web information systems
 Not suitable for ad hoc queries
Motivation

Many Web documents are dynamically
generated from databases


E.g. Catalog data
Keyword querying of generated Web
documents


May miss answers that need to combine
information on different pages
Suffers from duplication overheads
Examples of Keyword Queries

On a railway reservation database


On a university database


“database course”
On an e-store database


“mumbai bangalore”
“camcorder panasonic”
On a book store database

“sudarshan databases”
Differences from IR/Web
Search

Related data split across multiple tuples
due to normalization


E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
Different keywords may match tuples
from different relations

What joins are to be computed can only be
decided on the fly

Cites(citing-paper-id, cited-paper-id)
Connectivity

Tuples may be connected by




Foreign key and object references
Inclusion dependencies and join conditions
Implicit links (shared words), etc.
Would like to find sets of (closely)
connected tuples that match all given
keywords
Basic Model

Database: modeled as a graph


Nodes = tuples
Edges = references between tuples
foreign key, inclusion dependencies, ..
BANKS: Keyword search…
MultiQuery Optimization
 Edges are directed.

paper
writes
Charuta
S. Sudarshan
Prasan Roy
author
Answer Example
Query: sudarshan roy
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
Combining Keyword Search and Browsing

Catalog searching applications


Keywords may restrict answers to a small
set, then user needs to browse answers
If there are multiple answers, hierarchical
browsing required on the answers
What Banks Does
The whole DB seen
as a directed graph
(edges correspond to
foreign keys)
Answers are subgraphs
Ranked by edge weights
Solutions as rooted weighted
trees

In BANKS, each potential solution is a rooted
weighted tree where

Nodes are tuples from tables

Node weight can be defined in terms of “pagerank” style
notions (e.g. back-degree)


Edges are foreign-primary key references between tuples
across tables

Links are given domain specific weights



They use log(1+x) where x is the number of back links
Paperwrites is seen as stronger than Papercites table
Tuples in the tree must cover keywords
Relevance of a tree is based on its weight

Weight of the tree is a combination of its node and link
weights
BANKS: Keyword Search in DB
11/23: Imprecise Queries
Collection Selection
Part III: Answer Imprecise Queries
with
[ICDE 2006;WebDB, 2004; WWW, 2004]
Why Imprecise Queries ?
Toyota
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Camry
$7000
1999
Toyota
Camry
$7000
2001
Toyota
Camry
$6700
2000
Toyota
Camry
$6500
1998
………
Dichotomy in Query Processing
Databases
IR Systems
User knows what she
wants
•
User has an idea of
what she wants
User query completely
expresses the need
•
User query captures
the need to some
degree
•
Answers ranked by
degree of relevance
Answers exactly matching
query constraints
Imprecise queries on databases cross the divide
Existing Approaches
Similarity search over Vector space
•
Data must be stored as vectors of text
WHIRL, W. Cohen, 1998
Enhanced database model
•
•
Add ‘similar-to’ operator to SQL. Distances
provided by an expert/system designer
1. User/expert must provide
similarity measures
2. New operators to use distance
measures
VAGUE, A. Motro, 1998
3. Not applicable over autonomous
databases
Binderberger et al, 2003
Our Objectives:
Support similarity search and query
refinement over abstract data types
User guidance
•
Limitations:
Users provide information about objects
required and their possible neighborhood
Proximity Search, Goldman et al, 1998
1. Minimal user input
2. Database internals not affected
3. Domain-independent &
applicable to Web databases
Imprecise queries vs. Empty queries
The “empty query” problem arises when the user’s query, when
submitted to the database leads to empty set of answers.
•
We want to develop methods that can automatically minimally relax
this empty query and resubmit it so the user gets some results
Existing approaches for empty query problem are mostly syntactic—
and rely on relaxing various query constraints
•
Little attention is paid to the best order in which to relax the
constraints.
Imprecise query problem is a general case of empty query problem
•
•
We may have non-empty set of answers to the base query
We are interested not just in giving some tuples but give them in
the order of relevance
General ideas for supporting imprecise queries
Main issues are
1.
2.
How to rewrite the base query such that more relevant tuples can
be retrieved.
How to rank the retrieved tuples in the order of relevance.
A spectrum of approaches are possible—including
1.
2.
3.
Data-dependent approaches
User-dependent approaches
Collaborative approaches
We will look at an approach—which is basically data-dependent
AFDs based Query Relaxation
Imprecise
Query Map: Convert
Q “like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Derive Extended Set by
executing relaxed queries
Use Concept similarity
to measure tuple
similarities
Prune tuples below
threshold
Return Ranked Set
An Example
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Relation:- CarDB(Make, Model, Price, Year)
Imprecise query
Q :− CarDB(Model like “Camry”, Price like “10k”)
Base query
Qpr :− CarDB(Model = “Camry”, Price = “10k”)
Base set Abs
Use Concept similarity
to measure tuple
similarities
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
Obtaining Extended Set
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Problem: Given base set, find tuples from database similar
to tuples in base set.
Solution:
•
Consider each tuple in base set as a selection query.
e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
•
Relax each such query to obtain “similar” precise queries.
e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”
•
Execute and determine tuples having similarity above some
threshold.
Challenge: Which attribute should be relaxed first ?
Make ? Model ? Price ? Year ?
Solution: Relax least important attribute first.
•
Least Important Attribute
Definition: An attribute whose binding value when changed has minimal
effect on values binding other attributes.
•
Does not decide values of other attributes
•
Value may depend on other attributes
E.g. Changing/relaxing Price will usually not affect other attributes
but changing Model usually affects Price
Dependence between attributes useful to decide relative importance
•
Approximate Functional Dependencies & Approximate Keys

Approximate in the sense that they are obeyed by a large percentage (but
not all) of tuples in the database
•
Can use TANE, an algorithm by Huhtala et al [1999]
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Attribute Ordering
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Given a relation R
•
•
•
Determine the AFDs and Approximate Keys
Pick key with highest support, say Kbest
Partition attributes of R into


•
key attributes i.e. belonging to Kbest
non-key attributes I.e. not belonging to Kbest
Sort the subsets using influence weights
(1  error ( A'  Aj))

InfluenceWeight ( Ai) 
| A' |
CarDB(Make, Model, Year,
Price)
Key attributes: Make, Year
Non-key: Model, Price
Order: Price, Model, Year, Make
1- attribute: { Price, Model, Year,
Make}
where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)|
Attribute relaxation order is all non-keys first then keys
Multi-attribute relaxation - independence assumption
2-attribute: {(Price, Model),
(Price, Year), (Price, Make)….. }
Tuple Similarity
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Tuples obtained after relaxation are ranked according to their
similarity to the corresponding tuples in base set
Similarity (t1, t 2)   AttrSimilarity (value(t1[ Ai]), value(t 2[ Ai])) Wi
where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to
|Attributes(R)|
Value Similarity
• Euclidean for numerical attributes e.g. Price, Year
•
Use Concept similarity
to measure tuple
similarities
Concept Similarity for categorical e.g. Make, Model
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Concept (Value) Similarity
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Concept: Any distinct attribute value pair.
E.g. Make=Toyota
•
•
Visualized as a selection query binding a
single attribute
Represented as a supertuple
ST(QMake=Toyota)
Model
Year
Camry: 3, Corolla: 4,….
2000:6,1999:5 2001:2,……
Price
6500:3,
4000:6
Concept Similarity: Estimated as the
Supertuple
for5995:4,
Concept
Make=Toyota
percentage of correlated values
common to two given concepts
Similarity (v1, v2)   Commonality(Correlated (v1, values( Ai)), Correlated (v2, values( Ai)))
where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R
•
Measured as the Jaccard Similarity
among supertuples representing the
concepts
JaccardSim(A,B) =
A B
A B
Concept (Value) Similarity Graph
Dodge
Nissan
0.15
0.11
Honda
BMW
0.12
0.22
0.25
0.16
Ford
Chevrolet
Toyota
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Empirical Evaluation of
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Goal
•
Evaluate the effectiveness of the query relaxation and
concept learning
Setup
•
A database of used cars
CarDB( Make, Model, Year, Price, Mileage, Location, Color)
•
•
•
Populated using 30k tuples from Yahoo Autos
Concept similarity estimated for Make, Model, Location, Color
Two query relaxation algorithms


RandomRelax – randomly picks attribute to relax
GuidedRelax – uses relaxation order determined using
approximate keys and AFDs
Use Concept similarity
to measure tuple
similarities
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Evaluating the effectiveness of relaxation
Test Scenario
•
10 randomly selected base queries from CarDB
•
20 tuples showing similarity > Є

•
•
Weighted summation of attribute similarities

Euclidean distance used for Year, Price, Mileage

Concept Similarity used for Make, Model, Location, Color
Limit 64 relaxed queries per base query

•
0.5 < Є < 1
128 max possible – 7 attributes
Efficiency measured using metric
Work / Re levantTuple 
| ExtractedTuples |
| Re levantExtracted |
Use Concept similarity
to measure tuple
similarities
Efficiency of Relaxation in
Random Relaxation
Guided Relaxation
180
Є = 0.7
900
Є= 0.7
Є = 0.6
Є = 0.5
700
Є = 0.6
Є = 0.5
140
Work/Relevant Tuple
800
Work/Relevant Tuple
160
600
500
400
300
120
100
80
60
200
40
100
20
0
0
1
2
3
4
5
6
7
8
Queries
9
10
1
2
3
4
5
6
7
8
Queries
•Average 8 tuples extracted per
relevant tuple for Є =0.5.
Increases to 120 tuples for Є=0.7.
•Average 4 tuples extracted per
relevant tuple for Є=0.5. Goes up to
12 tuples for Є= 0.7.
•Not resilient to change in Є
•Resilient to change in Є
9
10
Summary
An approach for answering imprecise queries over Web
database
•
•
•
Mine and use AFDs to determine attribute importance
Domain-independent concept similarity estimation technique
Tuple similarity score as a weighted sum of attribute similarity scores
Empirical evaluation shows
•
•
Reasonable concept similarity models estimated
Set of similar precise queries efficiently identified
Collection Selection/Meta Search
Introduction
• Metasearch Engine
• A system that provides unified access
to multiple existing search engines.
• Metasearch Engine Components
– Database Selector
• Identifying potentially useful
databases for each user query
– Document Selector
• Identifying potentially useful document
returned from selected databases
– Query Dispatcher and Result Merger
• Ranking the selected documents
Collection Selection
Collection
Selection
Query
Execution
WSJ
WP
FT
Results
Merging
CNN
NYT
Evaluating collection selection
• Let c1..cj be the collections that are chosen to be accessed
for the query Q. Let d1…dk be the top documents returned
from these collections.
• We compare these results to the results that would have
been returned from a central union database
– Ground Truth: The ranking of documents that the retrieval
technique (say vector space or jaccard similarity) would have
retrieved from a central union database that is the union of all the
collections
• Compare precision of the documents returned by accessing
General Scheme & Challenges
• Get a representative of each of the database
– Representative is a sample of files from the database
• Challenge: get an unbiased sample when you can only access the
database through queries.
• Compare the query to the representatives to judge the
relevance of a database
– Coarse approach: Convert the representative files into a single file
(super-document). Take the (vector) similarity between the query
and the super document of a database to judge that database’s
relevance
– Finer approach: Keep the representative as a mini-database. Union
the mini-databases to get a central mini-database. Apply the query
to the central mini-database and find the top k answers from it.
Decide the relevance of each database based on which of the
answers came from which database’s representative
• You can use an estimate of the size of the database too
– What about overlap between collections? (See ROSCO paper)
Uniform Probing for Content
Summary Construction
• Automatic extraction of document frequency
statistics from uncooperative databases
– [Callan and Connell TOIS 2001],[Callan et al. SIGMOD 1999]
• Main Ideas
– Pick a word and send it as a query to database D
• RandomSampling-OtherResource(RS-Ord): from a dictionary
• RandomSampling-LearnedResource(RS-Lrd): from retrieved documents
– Retrieval the top-K documents returned
– If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at
the beginning
– k=4 , T=300
– Compute the sample document frequency for each word that appeared in a retrieved
document.
CORI Net Approach
(Representative as a super document)
• Representative Statistics
– The document frequency for each term and each database
– The database frequency for each term
• Main Ideas
– Visualize the representative of a database as a super document, and the set
of all representative as a database of super documents
– Document frequency becomes term frequency in the super document, and
database frequency becomes document frequency in the super database
– Ranking scores can be computed using a similarity function such as the
Cosine function
ReDDE Approach
(Representative as a mini-collection)
• Use the representatives as mini collections
• Construct a union-representative that is the union of the
mini-collections (such that each document keeps
information on which collection it was sampled from)
• Send the query first to union-collection, get the top-k
ranked results
– See which of the results in the top-k came from which minicollection. The collections are ranked in terms of how much their
mini-collections contributed to the top-k answers of the query.
– Scale the number of returned results by the expected size of the
actual collection
We didn’t cover beyond this
Selecting among overlapping
collections
 Overlap between
collections

News metasearcher,
bibliography search
engine, etc.
 Objectives:
 Retrieve variety of
results
 Avoid collections
with irrelevant or
redundant results
Results
Collection
Selection
Collections:
1. FT
2. CNN
WSJ
Existing work (e.g. CORI) assumes
collections are disjoint!
10/21/2004
1. ……
2. ……
3. ……
.
.
“bank mergers”
MS Thesis Defense
Thomas Hernandez
Query
Execution
Results
Merging
WP
FT
CNN
NYT
Arizona State University
ROSCO approach
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
Challenge: Defining & Computing Overlap
Collection C1
 Collection overlap may
be non-symmetric, or
“directional”. (A)
1. Result A
Collection C2
A.
1. Result V
2. Result B
2. Result W
3. Result C
3. Result X
4. Result D
4. Result Y
5. Result E
5. Result Z
6. Result F
7. Result G
Collection C1
Collection C2
1. Result A
2. Result B
3. Result C
1. Result V
2. Result W
3. Result X
 Document overlap may
be non-transitive. (B)
B.
4. Result D
5. Result E
6. Result F
4. Result Y
5. Result Z
7. Result G
Collection C3
1. Result I
2. Result J
3. Result K
4. Result L
5. Result M
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
Gathering Overlap Statistics
 Solution:
 Consider query result
set of a particular
collection as a single
bag of words:
 Approximate overlap
between 3+
collections using only
pairwise overlaps
 Approximate overlap
as the intersection
between the result set
bags:
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
Controlling Statistics
 Objectives:
 Limit the number of statistics stored
 Improve the chances of having statistics for new
queries
 Solution:
 Identify frequent item sets among queries (Apriori
algorithm)
 Store statistics only with respect to these frequent
item sets
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
The Online Component
Collection Selection System
 Purpose: determine
collection order for user
query
User query
Gather coverage
and overlap
information for
past queries
Map the query to
frequent item sets
Compute statistics
for the query using
mapped item sets
Collection
Order
1. ……
2. ……
.
 1. Map query to stored
item sets
 2. Compute statistics
for query
Identify frequent
item sets among
queries
Coverage / Overlap
Statistics
Determine
collection order for
query
Compute statistics
for the frequent
item sets
Online Component
Offline Component
Map the query to
frequent item sets
Compute statistics
for the query using
mapped item sets
 3. Determine collection
order
10/21/2004
MS Thesis Defense
Thomas Hernandez
Determine
collection order for
query
Arizona State University
Training our System
 Training set: 90% of the query list
 Gathering statistics for training queries:
 Probing of the 15 collections
 Identifying frequent item sets:
 Support threshold used: 0.05% (i.e. 9 queries)
 681 frequent item sets found
 Computing statistics for item sets:
 Statistics fit in a 1.28MB file
 Sample entry:
network,neural
22 MIX15 0.11855 CI,SC 747
AG 0.07742 AD 0.01893
SC,MIX15 801.13636 …
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
Performance Evaluation
 Measuring number of new and duplicate results:
 Duplicate result: has cosine similarity > 0.95 with at
least one retrieved result
 New result: has no duplicate
 Oracular approach:
 Knows which collection has most new results
 Retrieves large portion of new results early
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
Evaluaton of Collection Selection
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University
10/21/2004
MS Thesis Defense
Thomas Hernandez
Arizona State University