Download www.ipeirotis.com

Document related concepts

Commitment ordering wikipedia , lookup

Serializability wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Microsoft Access wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Global serializability wikipedia , lookup

Encyclopedia of World Problems and Human Potential wikipedia , lookup

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Classifying and Searching
"Hidden-Web" Text Databases
Panos Ipeirotis
Computer Science Department
Columbia University
Motivation?
“Surface” Web vs. “Hidden” Web
Keywords
SUBMIT

“Surface” Web
–
–
–
5/22/2017
Link structure
Crawlable
Documents indexed
by search engines

CLEAR
“Hidden” Web
–
–
–
–
No link structure
Documents “hidden” in databases
Documents not indexed by search engines
Need to query each collection individually
Panos Ipeirotis - Columbia University
2
Hidden-Web Databases: Examples
Search on U.S. Patent and Trademark Office (USPTO) database:
[wireless network]  29,051 matches
(USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html)
Search on Google restricted to USPTO database site:
[wireless network site:patft.uspto.gov]  0 matches
Database
Query
Database
Matches
Site-Restricted
Google Matches
USPTO
wireless network
29,051
0
Library of Congress
visa regulations
>10,000
0
PubMed
thrombopenia
26,887
221
as of July 6th, 2004
5/22/2017
Panos Ipeirotis - Columbia University
3
Interacting With Hidden-Web Databases

Browsing: Yahoo!-like directories


InvisibleWeb.com
SearchEngineGuide.com
Populated
Manually

5/22/2017
Root
Arts
Computers
Legal
Science
Sports
Patents
Searching: Metasearchers
Panos Ipeirotis - Columbia University
USPTO
4
Outline of Talk



5/22/2017
Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in Hidden-Web
Databases
Panos Ipeirotis - Columbia University
5
Hierarchically Classifying the
ACM Digital Library
Root
ACM DL
?
Arts

?
Software
?
Computers
?
Health


?
Science

?
Sports

?
Hardware
C/C++
5/22/2017
?
Programming
Perl
Java
Panos Ipeirotis - Columbia University
Visual Basic
6
Text Database Classification: Definition
 For a text database D and a category C:


Coverage(D,C) = number of docs in D about C
Specificity(D,C) = fraction of docs in D about C
 Assign a text database to a category C if:


5/22/2017
Database coverage for C at least Tc
Tc: coverage threshold (e.g., > 100 docs in C)
Database specificity for C at least Ts
Ts: specificity threshold (e.g., > 40% of docs in C)
Panos Ipeirotis - Columbia University
7
Brute-Force Classification “Strategy”
1. Extract all documents from database
2. Classify documents on topic
(use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…)
3. Classify database according to topic distribution
Problem: No direct access to full contents
of Hidden-Web databases
5/22/2017
Panos Ipeirotis - Columbia University
8
Classification: Goal & Challenges
 Goal:
Discover database topic distribution
 Challenges:
 No direct access to full contents of Hidden-Web databases
 Only limited search interfaces available
 Should not overload databases
Key observation:
Only queries “about” database topic(s)
generate large number of matches
5/22/2017
Panos Ipeirotis - Columbia University
9
Query-based Database Classification:
Overview
1.
Train document classifier
2.
Extract queries from classifier
3.
Adaptively issue queries to database
4.
Identify topic distribution based on adjusted
number of query matches
5.
Classify database
TRAIN
CLASSIFIER
Sports:
+nba +knicks
Health
+sars
EXTRACT
QUERIES
+sars
QUERY
DATABASE
1254
IDENTIFY
TOPIC
DISTRIBUTION
Root
5/22/2017
CLASSIFY
DATABASE
Panos Ipeirotis - Columbia University
Arts
Computers
Legal
Science
Sports
    
10
Training a Document Classifier
1.
Get training set (set of pre-classified documents)
2.
Select best features to characterize documents
(Zipf’s law + information theoretic feature selection)
[Koller and Sahami 1996]
3.
TRAIN
CLASSIFIER
Sports:
+nba +knicks
EXTRACT
QUERIES
Health
+sars
Train classifier (SVM, C4.5, RIPPER, …)
QUERY
DATABASE
Output:
A “black-box” model for classifying documents
IDENTIFY
TOPIC
DISTRIBUTION
Root
Root
Document
Arts
Classifier
5/22/2017
Computers
Legal
Science
Sports
 
 
Panos Ipeirotis - Columbia University
CLASSIFY
DATABASE
Arts
Computers
Legal
Science
Sports
    
11
Extracting Query Probes
Transform classifier model into queries
 Trivial for “rule-based” classifiers (RIPPER)
 Easy for decision-tree classifiers (C4.5) for which
ACM TOIS 2003
TRAIN
CLASSIFIER
Sports:
+nba +knicks
EXTRACT
QUERIES
Health
+sars
rule generators exist (C4.5rules)
C4.5rules
+sars
QUERY
DATABASE
1254
 Trickier for other classifiers: we devised rule-
extraction methods for linear classifiers (linearkernel SVMs, Naïve-Bayes, …)
IDENTIFY
TOPIC
DISTRIBUTION
Rule extraction
Root
CLASSIFY
DATABASE
Example query for Sports: +nba +knicks
5/22/2017
Panos Ipeirotis - Columbia University
Arts
Computers
Legal
Science
Sports
    
12
Querying Database with Extracted Queries
 Issue each query to database to obtain number
of matches without retrieving any documents
TRAIN
CLASSIFIER
Sports:
+nba +knicks
EXTRACT
QUERIES
 Increase coverage of rule’s category accordingly
(#Sports = #Sports + 706)
Health
+sars
+sars
QUERY
DATABASE
1254
IDENTIFY
TOPIC
DISTRIBUTION
Root
SIGMOD 2001
5/22/2017
ACM TOIS 2003
Panos Ipeirotis - Columbia University
CLASSIFY
DATABASE
Arts
Computers
Legal
Science
Sports
    
13
Identifying Topic Distribution from Query Results
Query-based estimates of topic distribution not perfect
TRAIN
CLASSIFIER
 Document classifiers not perfect:

Rules for one category match documents from
other categories
Sports:
+nba +knicks
EXTRACT
QUERIES
Health
+sars
 Querying not perfect:


Queries for same category might overlap
Queries do not match all documents in a
category
QUERY
DATABASE
IDENTIFY
TOPIC
DISTRIBUTION
Solution: Learn to adjust
results of query probes
5/22/2017
Panos Ipeirotis - Columbia University
Root
CLASSIFY
DATABASE
Arts
Computers
Legal
Science
Sports
    
14
Confusion Matrix Adjustment of
Query Probe Results
correct class
comp
Correct (but unknown)
topic distribution
sports
health
comp
0.80
0.10
0.00
sports
0.08
0.85
0.04
health
0.02
0.15
0.96
assigned class
10% of “sport” documents
match queries for
“computers”
5/22/2017
Incorrect topic distribution derived
from query probing
Real
Coverage
X
1000
5000
Estimated Coverage
=
50
800+500+0 = 1300
80+4250+2 = 4332
20+750+48 = 818
This “multiplication” can be inverted
to get a better estimate of the real
topic distribution from the probe results
Panos Ipeirotis - Columbia University
15
Confusion Matrix Adjustment of
Query Probe Results
Coverage(D) ~
M-1
. ECoverage(D)
Adjusted estimate of
topic distribution
Probing results
 M usually diagonally dominant for
“reasonable” document classifiers, hence
invertible
TRAIN
CLASSIFIER
Sports:
+nba +knicks
EXTRACT
QUERIES
Health
+sars
QUERY
DATABASE
IDENTIFY
TOPIC
DISTRIBUTION
 Compensates for errors in query-based
estimates of topic distribution
5/22/2017
Panos Ipeirotis - Columbia University
Root
CLASSIFY
DATABASE
Arts
Computers
Legal
Science
Sports
    
16
Classification Algorithm (Again)
1.
TRAIN
CLASSIFIER
Train document classifier
One-time
process
2.
3.
Extract queries from classifier
Adaptively issue queries to database
4.
Identify topic distribution based on
adjusted number of query matches
5.
Classify database
Sports:
+nba +knicks
EXTRACT
QUERIES
Health
+sars
+sars
QUERY
DATABASE
1254
IDENTIFY
For every TOPIC
database DISTRIBUTION
Root
5/22/2017
CLASSIFY
DATABASE
Panos Ipeirotis - Columbia University
Arts
Computers
Legal
Science
Sports
    
17
Experimental Setup
 72-node 4-level topic hierarchy from
InvisibleWeb/Yahoo! (54 leaf nodes)
 500,000 Usenet articles (April-May 2000):


Newsgroups assigned by hand to hierarchy nodes
RIPPER trained with 54,000 articles (1,000 articles
per leaf), 27,000 articles to construct confusion matrix
 500 “Controlled” databases built
using 419,000 newsgroup articles
(to run detailed experiments)
comp.hardware
rec.music.classical
 130 real Web databases picked from
InvisibleWeb (first 5 under each topic)
5/22/2017
Panos Ipeirotis - Columbia University
rec.photo.*
18
Experimental Results:
Controlled Databases
 Accuracy (using F-measure):
 Above 80% for most <Tc, Ts> threshold combinations tried
 Degrades gracefully with hierarchy depth
 Confusion-matrix adjustment helps
 Efficiency:
Relatively small number of probes (<500) needed for most
threshold <Tc, Ts> combinations tried
5/22/2017
Panos Ipeirotis - Columbia University
19
Experimental Results: Web Databases
 Accuracy (using F-measure):
 ~70% for best <Tc, Ts> combination


Learned thresholds that reproduce human classification
Tested threshold choice using 3-fold cross validation
 Efficiency:
 120 queries per database on average needed for
choice of thresholds, no documents retrieved
 Only small part of hierarchy “explored”
 Queries are short: 1.5 words on average; 4 words
maximum (easily handled by most Web databases)
5/22/2017
Panos Ipeirotis - Columbia University
20
Other Experiments
 Effect of choice of document classifiers:

RIPPER
C4.5

Naïve Bayes

SVM

ACM TOIS 2003
 Benefits of feature selection
 Effect of search-interface heterogeneity:
Boolean vs. vector-space retrieval models
 Effect of query-overlap elimination step
 Over crawlable databases: query-based classification orders of
magnitude faster than “brute-force” crawling-based classification
IEEE Data Engineering
Bulletin 2003
5/22/2017
Panos Ipeirotis - Columbia University
21
Hidden-Web Database Classification:
Summary
 Handles autonomous Hidden-Web databases accurately and
efficiently:
 ~70% F-measure
 Only 120 queries issued on average, with no documents
retrieved
 Handles large family of document classifiers
(and can hence exploit future advances in machine learning)
5/22/2017
Panos Ipeirotis - Columbia University
22
Outline of Talk



5/22/2017
Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in Hidden-Web
Databases
Panos Ipeirotis - Columbia University
23
Interacting With Hidden-Web Databases

Browsing: Yahoo!-like directories
Searching: Metasearchers
Content not accessible
through Google
}

…
PubMed
Query
Metasearcher
…
USPTO
…
5/22/2017
Panos Ipeirotis - Columbia University
NYTimes
Archives
…
Library of
Congress
24
Metasearchers Provide Access to
Distributed Databases
Database selection relies on
simple content summaries: thrombopenia
vocabulary, word frequencies
Metasearcher
PubMed (11,868,552 documents)
…
aids
cancer
heart
hepatitis
thrombopenia
…
5/22/2017

123,826
1,598,896
PubMed
706,537
124,320
26,887

?
NYTimes
Archives
USPTO
...
...
...
thrombopenia 26,887
thrombopenia 0
thrombopenia 42
...
...
...
Panos Ipeirotis - Columbia University
25
Extracting Content Summaries from
Autonomous Hidden-Web Databases
[Callan&Connell 2001]
1.
2.
3.
Send random queries to databases
Retrieve top matching documents
If retrieved 300 documents then stop; else go to Step 1
Content summary contains words in sample
and document frequency of each word
Problems:
• Random sampling retrieves non-representative documents
• Frequencies in summary “compressed” to sample size range
• Summaries from small samples are highly incomplete
5/22/2017
Panos Ipeirotis - Columbia University
26
Extracting Representative Document Sample
Problem 1: Random sampling retrieves non-representative documents
1.
2.
3.
•
•
4.
5.
6.
Train a document classifier
Create queries from classifier
Adaptively issue queries to databases
Retrieve top-k matching documents for each query
Save #matches for each one-word query
Identify topic distribution based on adjusted number of query
matches
Categorize the database
Generate content summary from document sample
Sampling retrieves documents only from
“topically dense” areas from database
5/22/2017
Panos Ipeirotis - Columbia University
27
Sample Frequencies vs. Actual Frequencies
Problem 2: Frequencies in summary “compressed” to sample size range
PubMed (11,868,552 docs)
…
cancer
heart
…
1,562,477
691,360
Sampling
PubMed Sample (300 documents)
…
cancer
heart
…
45
16
Key Observation: Query matches reveal frequency information
5/22/2017
Panos Ipeirotis - Columbia University
28
Adjusting Document Frequencies
 Zipf’s law empirically
connects word frequency f
and rank r
f = A (r + B) c
frequency
...
cancer
...
...
liver
...
kidneys
...
...
...
stomach
hepatitis
rank
VLDB 2002
5/22/2017
Panos Ipeirotis - Columbia University
29
Adjusting Document Frequencies
 Zipf’s law empirically
connects word frequency f
and rank r
f = A (r + B) c
frequency
 We know document
frequency and rank r of
the words in sample
Frequency in sample
...
cancer
1
...
...
liver
12
...
...
kidneys
...
78
Rank in sample
5/22/2017
Panos Ipeirotis - Columbia University
...
stomach
hepatitis
rank
….
VLDB 2002
30
Adjusting Document Frequencies
 Zipf’s law empirically
connects word frequency f
and rank r
f = A (r + B) c
frequency
 We know document
140,000 matches
Frequency in database
frequency and rank r of
the words in sample
60,000 matches
 We know real document
frequency f of some
words from one-word
queries
20,000 matches
...
cancer
1
...
...
liver
12
...
...
kidneys
...
78
Rank in sample
5/22/2017
Panos Ipeirotis - Columbia University
...
stomach
hepatitis
….
rank
VLDB 2002
31
Adjusting Document Frequencies
 Zipf’s law empirically
connects word frequency f
and rank r
f = A (r + B) c
frequency
 We know document
140,000 matches
frequency and rank r of
the words in sample
Estimated frequency in
database
 We know real document
60,000 matches
frequency f of some
words from one-word
queries
20,000 matches
...
 We use curve-fitting to
...
...
cancer
liver
1
12
...
...
kidneys
78
estimate the absolute
frequency of all words in
sample
5/22/2017
...
...
stomach
hepatitis
rank
….
VLDB 2002
Panos Ipeirotis - Columbia University
32
Actual PubMed Content Summary
PubMed content summary
 Extracted automatically
Number of Documents:
8,868,552 (Actual: 12,349,932)
 ~ 27,500 words in extracted
Category: Health, Diseases
content summary
…
cancer
1,562,477
heart
581,506 (Actual: 706,537)
aids
121,491
hepatitis
sent
73,481 (Actual: 124,320)
…
basketball
907 (Actual: 1,094)
cpu
598
 Fewer than 200 queries
 At most 4 documents
retrieved per query
(heart, hepatitis, basketball not in 1-word probes)
5/22/2017
Panos Ipeirotis - Columbia University
33
Sampling and Incomplete Content Summaries
Problem 3: Summaries from small samples are highly incomplete
Sample=300
Log(Frequency)
107
106
10% most frequent words
in PubMed database
9,000
..
endocarditis
~9,000 docs / ~0.1%
103
102
2·104
4·104
105
Rank
 Many words appear in “relatively few” documents (Zipf’s law)
 Low-frequency words are often important
 Small document samples miss many low-frequency words
5/22/2017
Panos Ipeirotis - Columbia University
34
Sample-based Content Summaries
Challenge:
Improve content summary quality
without increasing sample size
Main Idea: Database Classification Helps
 Similar topics ↔ Similar content summaries
 Extracted content summaries complement each other
5/22/2017
Panos Ipeirotis - Columbia University
35
Databases with Similar Topics
 Cancerlit contains
“metastasis”, not
found during sampling
CANCERLIT
Number of Documents: 148,944
 CancerBacup contains
“metastasis”
 Databases under
…
breast
…
cancer
…
thrombopenia
…
metastasis
...
121,134
...
91,688
...
11,344
…
<not found>
CancerBACUP
Number of Documents: 17,328
…
breast
…
cancer
…
thrombopenia
…
metastasis
...
12,546
...
9,735
...
<not found>
…
3,569
same category have
similar vocabularies,
and can complement
each other
5/22/2017
Panos Ipeirotis - Columbia University
36
Content Summaries for Categories
 Databases under
Category: Root
Fraction of sample
0.2%
metastasis
word
same category share
similar vocabulary
word
 Higher level category
content summaries
provide additional
useful estimates
Category: Health
Fraction of sample
metastasis
Database: PubMed
Fraction of sample
metastasis
4%
word
5%
word
Category: Cancer
Fraction of sample
metastasis
9.2%
 Can use all estimates
in category path
word
Database: CANCERLIT
Fraction of sample
metastasis
5/22/2017
Panos Ipeirotis - Columbia University
0%
Database: CANCERBACUP
word
Fraction of sample
metastasis
12%
37
Enhancing Summaries Using “Shrinkage”
Category: Root (|Sample| = 30,000)
word
Fraction of sample
metastasis
0.2% (± 0.01%)
Category: Health (|Sample| = 8,000)
word
Fraction of sample
metastasis
5% (± 0.1%)
Category: Cancer (|Sample| = 1,200)
word
Fraction of sample
metastasis
9.2% (± 2%)
Database: D (|Sample| = 300 docs)
word
Fraction of sample
metastasis
 Estimates from database content
summaries can be unreliable
 Category content summaries are
more reliable (based on larger
samples) but less specific to
database
 By combining estimates from
category and database content
summaries we get better estimates
0% (± 12%)
SIGMOD 2004
5/22/2017
Panos Ipeirotis - Columbia University
38
Shrinkage-based Estimations
Category: Root (|Sample| = 30,000)
word
Fraction of sample
metastasis
0.002
Adjust probability estimates
Pr [metastasis | D] =
λ1 * 0.002 + λ2 * 0.05 + λ3 * 0.092 + λ4 * 0.000
Category: Health (|Sample| = 8,000)
word
Fraction of sample
metastasis
0.005
Category: Cancer (|Sample| = 1,200)
word
Fraction of sample
metastasis
0.092
Database: D (|Sample| = 300 docs)
word
Fraction of sample
metastasis
5/22/2017

0 (??)
Select λi weights to maximize
the probability that the summary of D
is from a database under all its parent categories
Avoids “sparse data” problem and
decreases estimation risk
Panos Ipeirotis - Columbia University
39
Shrinkage Weights and Summary
new
estimates
CANCERLIT
Shrinkage-based
metastasis
old
estimates
λroot=0.02
λhealth=0.13 λcancer=0.20 λcancerlit=0.65
2.5%
0.2%
5%
9.2%
0%
aids
14.3%
0.8%
7%
2%
20%
football
0.17%
2%
1%
0%
0%
…
…
…
…
…
…
Shrinkage:
 Increases estimations for underestimates (e.g., metastasis)
 Decreases word-probability estimates for overestimates (e.g., aids)
 …it also introduces (with small probabilities) spurious words (e.g., football)
5/22/2017
Panos Ipeirotis - Columbia University
40
 Database selection algorithms assign
scores to databases for each query
Probability
Adaptive Application of Shrinkage
Unreliable Score Estimate:
Use shrinkage
 When frequency estimates are uncertain,
assigned score is uncertain…
 …but sometimes confidence about
assigned score is high
unnecessary
0
5/22/2017
Panos Ipeirotis - Columbia University
1
Database Score for a Query
Reliable Score Estimate:
Shrinkage might hurt
Probability
 When confident about score, shrinkage
0
Database Score for a Query
1
41
Extracting Content Summaries:
Problems Solved
Problem 1: Random sampling may retrieve non-representative documents
Solution: Focus querying on “topically dense” areas of the database
Problem 2: Frequencies are “compressed” to the sample size range
Solution: Exploit number of matches for query and adjust estimates using
curve fitting
Problem 3: Summaries based on small samples are highly incomplete
Solution: Exploit database classification and augment summaries using
samples from topically similar databases
5/22/2017
Panos Ipeirotis - Columbia University
42
Searching Algorithm
One-time
process
1.
2.
Classify databases and extract document samples
Adjust frequencies in samples
For each query Q:
For each database:

Assign score to each database D (using extracted
content summary)
For every
query

Examine uncertainty of score

If uncertainty high, apply shrinkage and give new
score
Query only top-K scoring databases
5/22/2017
Panos Ipeirotis - Columbia University
43
Experimental Setup
 Two standard testbeds from TREC


200 databases
100 queries with associated human-assigned document
relevance judgments
 Two sets of experiments:

(“Text Retrieval Conference”):
SIGMOD 2004
Content summary quality
Metrics: precision, recall, Spearman correlation coefficient,
KL-divergence

Database selection accuracy
Metric: fraction of relevant documents for queries in top-scored
databases
5/22/2017
Panos Ipeirotis - Columbia University
44
Experimental Results
Content summary quality:
 Shrinkage improves quality of content summaries without increasing
sample size
 Frequency estimation gives accurate (within ±30%) estimates of actual
frequencies
Database selection accuracy:
 Focused sampling: Improves performance by 20%-40%
 Adaptive application of shrinkage: Improves performance up to 50%
 Shrinkage is robust: Improved performance consistently across many
different configurations
5/22/2017
Panos Ipeirotis - Columbia University
45
Results: Database Selection
 Metric: R(K) = Χ / Υ


X = # of relevant documents in the selected K databases
Y = # of relevant documents in the best K databases
0.8
0.75
0.7
Rk
0.65
0.6
0.55
0.5
0.45
Shrinkage
No Shrinkage
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
CORI, with stemming, TREC4 testbed
5/22/2017
Panos Ipeirotis - Columbia University
46
Other Experiments
 Additional data set: 315 real Web databases
 Choice of database selection algorithm (CORI, bGlOSS,
Language Modeling)
 Effect of stemming
 Effect of stop-word elimination
 Comparison with VLDB’02 hierarchical database
selection algorithm
 Universal vs. adaptive application of shrinkage
SIGMOD 2004
5/22/2017
Panos Ipeirotis - Columbia University
47
Search: Contributions
Developed strategy to automatically summarize contents
of Hidden-Web text databases



Strategy assumes no cooperation from databases
Improves content summary quality by exploiting topical
similarity and number of matches
No increase in document sample size required
Developed adaptive database selection strategy that
decides whether to apply shrinkage on a database- and
query-specific way
5/22/2017
Panos Ipeirotis - Columbia University
48
Outline of Talk



5/22/2017
Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in
Hidden-Web Databases
Panos Ipeirotis - Columbia University
49
Do Content Summaries Change Over Time?
Databases are not static. Their content changes.
Should we refresh the content summary?
 Examined summaries of 152 real Web databases over 52 weeks
 Summary quality declines as age increases
5/22/2017
Panos Ipeirotis - Columbia University
50
Updating Content Summaries
 Summaries change  Need to refresh to capture changes
 To devise update policy  Need to know frequency of “change”



Summary changes at time T if dist(current, old(T)) > τ
Survival analysis estimates probability S(t) that T>t
Common model S(t) = e-λt (λ defines frequency of change)
change
sensitivity
threshold
Problems:


5/22/2017
No access to content summaries
Even if we know summaries, long time to estimate λ
Panos Ipeirotis - Columbia University
51
Cox Proportional Hazards Regression
We want to estimate frequency of change for each database
 Cox PH model examines effect of database
characteristics on frequency of change

E.g., “if you double the size of a database, it changes twice
as fast”
 Cox PH model effectively uses “censored” data (i.e.,
database did not change within time T)
5/22/2017
Panos Ipeirotis - Columbia University
52
Cox PH Regression Results
 Examined effect of:
 Change-sensitivity threshold τ
 Topic
 Domain
 Size
 Number of words
 Differences of summaries extracted in consecutive weeks
 Devised concrete “change model” according to database
characteristics (formula in thesis)
5/22/2017
Panos Ipeirotis - Columbia University
53
Scheduling Updates
λ
D
Tom’s Hardware
USPS
average time between updates
10 weeks
40 weeks
0.088 5 weeks
46 weeks
0.023 12 weeks
34 weeks
Using our change model, we schedule updates
according to the available resources (using
Lagrange-multiplier method)
5/22/2017
Panos Ipeirotis - Columbia University
54
Scheduling Results
With clever scheduling we improve the quality of
summaries according to a variety of metrics
(precision, recall, KL-divergence)
5/22/2017
Panos Ipeirotis - Columbia University
55
Updating Content Summaries:
Contributions
 Extensive experimental study showing that quality of
summaries deteriorates for increasing summary age
 Change frequency model that uses database
characteristics to predict frequency of change
 Derived scheduling algorithms that define update
frequency for each database according to the
available resources
5/22/2017
Panos Ipeirotis - Columbia University
56
Overall Contributions
Support for browsing, searching and updating autonomous Hidden-Web
databases
Browsing:
 Algorithm for automatic classification of Hidden-Web databases
 Accuracy ~70% (F-measure)
 Only 120 queries issued on average, with no documents retrieved
Searching:
 Content summary construction technique that samples “topically dense” areas of
the database
 Database selection algorithms (hierarchical and shrinkage-based) that improve
existing algorithms by exploiting topical similarity
Updating:
 Change model that uses database characteristics to predict frequency of change
 Scheduling algorithms that exploit the model and define update frequency for each
database according to the available resources
5/22/2017
Panos Ipeirotis - Columbia University
57
Thank you!
Classification and content summary extraction implemented and available
for download at: http://sdarts.cs.columbia.edu
5/22/2017
Panos Ipeirotis - Columbia University
58
Panos Ipeirotis http://www.cs.columbia.edu/~pirot
Classification and Search of Hidden-Web Databases

P. Ipeirotis, L. Gravano, When one Sample is not Enough: Improving Text Database Selection using
Shrinkage [SIGMOD 2004]

L. Gravano, P. Ipeirotis, M. Sahami QProber: A System for Automatic Classification of Hidden-Web
Databases [ACM TOIS 2003]

E. Agichtein, P. Ipeirotis, L. Gravano Modelling Query-Based Access to Text Databases [WebDB 2003]

P. Ipeirotis, L. Gravano Distributed Search over the Hidden-Web: Hierarchical Database Sampling and
Selection [VLDB 2002]

P. Ipeirotis, L. Gravano, M. Sahami Query- vs. Crawling-based Classification of Searchable Web
Databases [DEB 2002]

P. Ipeirotis, L. Gravano, M. Sahami Probe, Count, and Classify: Categorizing Hidden-Web Databases
[SIGMOD 2001]
Approximate Text Matching

L. Gravano, P. Ipeirotis, N. Koudas, D. Srivastava Text Joins in an RDBMS for Web Data Integration
[WWW2003]

L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava Approximate String
Joins in a Database (Almost) for Free [VLDB 2001]

L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, L. Pietarinen Using
q-grams in a DBMS for Approximate String Processing [DEB 2001]
SDARTS: Protocol & Toolkit for Metasearching

N. Green, P. Ipeirotis, L. Gravano SDLIP + STARTS = SDARTS. A Protocol and Toolkit for Metasearching
[JCDL 2001]

P. Ipeirotis, T. Barry, L. Gravano Extending SDARTS: Extracting Metadata from Web Databases and
Interfacing with the Open Archives Initiative [JCDL 2002]
5/22/2017
Panos Ipeirotis - Columbia University
59
Future Work:
Integrated Access to Hidden-Web Databases
Query: [good indie movies playing in Manhattan tomorrow]
Current top Google result:
(Feb 17th, 2004):
Story at “Seattle Times” about 9-year
old drummer Rachel Trachtenburg
5/22/2017
Panos Ipeirotis - Columbia University
60
Future Work:
Integrated Access to Hidden-Web Databases
Query: [good indie movies playing in New York now]
query review databases
query movie databases
query ticket databases
All information already available on the Web



5/22/2017
Review databases: Rotten Tomatoes, NY Times, TONY,…
Movie databases: All Movie Guide, IMDB
Tickets: Moviefone, Fandango,…
Panos Ipeirotis - Columbia University
61
Future Work:
Integrated Access to Hidden-Web Databases
Query: [good indie movies playing in New York now]
query review databases
query movie databases
query ticket databases
 Challenges:
 Short term:
 Learn to interface with different databases
 Adapt database selection algorithms
 Long term:
 Understand semantics of query
 Extract “query plans” and optimize for distributed execution
 Personalization
 Security and privacy
5/22/2017
Panos Ipeirotis - Columbia University
62
SDARTS:
Protocol and Toolkit for Metasearching
Query
Harrison’s Online
SDARTS
British Medical Journal
PubMed
Unstructured
text documents
Local
5/22/2017
DLI2 Corpus
XML documents
Panos Ipeirotis - Columbia University
Web
63
SDARTS:
Protocol and Toolkit for Metasearching
Accomplishments:

Combines the strength of existing Digital Library protocols (SDLIP, STARTS)

Enables indexing and wrapping of “local” collections of text and XML documents

Enables “declarative” wrapping of Hidden-Web databases, with no
programming

Extracts content summary, topical focus, and technical level of each
database

Interfaces with Open Archives Initiative, an emerging Digital Library
interoperability protocol


Critical building block for search component of Columbia’s PERSIVAL project
(5-year, $5M NSF Digital Libraries – Phase 2 project)
ACM+IEEE JCDL
Conference 2001, 2002
Open source, available at: http://sdarts.cs.columbia.edu
~1,000 downloads since Jan 2003

5/22/2017
Supervised and coordinated eight students during development
Panos Ipeirotis - Columbia University
64
Current Work:
Updating Content Summaries
Databases are not static. Their content changes.
When should we refresh the content summary?
 Examined 150 real Web databases over 52 weeks
 Modeled changes using “survival analysis” techniques
(Cox proportional hazards model)
 Currently developing updating algorithms:


5/22/2017
Contact database only when necessary
Improve quality of summaries by exploiting history
Joint work with Junghoo Cho and Alex Ntoulas
(UCLA)
Panos Ipeirotis - Columbia University
65
Other Work:
Approximate Text Matching
VLDB’01
WWW’03
Matching similar strings within relational DBMS important: data resides there
Service A
Service B
Jenny Stamatopoulou
Panos Ipirotis
John Paul McDougal
Jonh Smith
Aldridge Rodriguez
Stamatopulou, Jenny
Panos Ipeirotis
John P. McDougal
John Smith
Al Dridge Rodriguez
Exact joins not enough: Typing mistakes, abbreviations, different conventions
Introduced algorithms for mapping approximate text joins into SQL:
1. No need for import/export of data
2. Provides crucial building block for data cleaning applications
3. Identifies many interesting matches
Joint work with Divesh Srivastava, Nick Koudas (AT&T Labs-Research) and others
5/22/2017
Panos Ipeirotis - Columbia University
66
No Good Category for Database

General “problem” with supervised learning
 Example: English vs. Chinese databases

Devised technique to analyze if can work with given database:
1. Find candidate textfields
2. Send top-level queries
3. Examine results & construct similarity matrix
4. If “matrix rank” small  Many “similar” pages returned

Web form is not a search interface

Textfield is not a “keyword” field

Database is of different language

Database is of an “unknown” topic
5/22/2017
Panos Ipeirotis - Columbia University
67
Database not Category Focused

5/22/2017
Extract one content summary per topic:
 Focused queries retrieve documents about known topic
 Each database is represented multiple times in hierarchy
Panos Ipeirotis - Columbia University
68
Near Future Work:
Definition and analysis of query-based algorithms
 Currently query-based algorithms are evaluated only empirically
 Possible to model querying process using random graph theory and:


Analyze thoroughly properties of the algorithms
Understand better why, when, and how the algorithms work
 Interested in exploring similar directions:


Adapt hyperlink-based ranking algorithms
Use results in graph theory to design sampling algorithms
WebDB 2003
5/22/2017
Panos Ipeirotis - Columbia University
69
Database Selection (CORI, TREC6)
1
0.9
0.8
0.7
Rk
0.6
0.5
0.4
0.3
QBS - Shrikage
0.2
QBS - Plain
0.1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
More results in …
Stemming/No Stemming, CORI/LM/bGlOSS, QBS/FPS/RS/CMPL, Stopwords
5/22/2017
Panos Ipeirotis - Columbia University
70
3-Fold Cross-Validation
0.9
0.8
F-1
0.7
F-2
F-measure
0.6
F-3
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Specificity Threshold
0.9
0.8
0.7
F-measure
0.6
0.5
0.4
0.3
0.2
0.1
0
5/22/2017
1
8
64
512
4096
Panos Ipeirotis - Columbia University
Coverage Threshold
32768
262144
71
Crawling- vs. Query-based
Classification for CNN Sports
Efficiency Statistics:
Crawling-based
Query-based
Time
Files
Size
Time
Queries
Size
1325min
270,202
8Gb
2min
(-99.8%)
112
357Kb
(-99.9%)
Accuracy Statistics:
IEEE DEB – March 2002
Crawling-based classification is classified correctly only
after downloading 70% of the documents in CNN-Sports
5/22/2017
Panos Ipeirotis - Columbia University
72
Experiments:
Precision of Database Selection Algorithms
Content Summary Generation
Technique
CORI
Hierarchical
Flat
FP-SVM-Documents
0.270
0.170
FP-SVM-Snippets
0.200
0.183
Random Sampling
0.177
QPilot (backlinks + front page)
0.050
VLDB 2002 (extended version)
5/22/2017
Panos Ipeirotis - Columbia University
73
F-measure vs. Hierarchy Depth
1.00
0.95
0.90
F1-measure
0.85
0.80
QP-RIPPER
0.75
QP-SVM
0.70
0.65
0.60
0.55
0.50
0
1
2
3
Hierarchy Depth
ACM TOIS 2003
5/22/2017
Panos Ipeirotis - Columbia University
74
Real Confusion Matrix for
Top Node of Hierarchy
5/22/2017
Health
Sports
Science
Computers Arts
Health
0.753
0.018
0.124
0.021
0.017
Sports
0.006
0.171
0.021
0.016
0.064
Science
0.016
0.024
0.255
0.047
0.018
Computers
0.004
0.042
0.080
0.610
0.031
Arts
0.004
0.024
0.027
0.031
0.298
Panos Ipeirotis - Columbia University
75
Overlap Elimination
1.00
0.95
0.90
F1-measure
0.85
0.80
0.75
0.70
0.65
0.60
QP-RIPPER
0.55
QP-RIPPER (overlap elimination)
0.50
0.45
0
5/22/2017
0.1
0.2
0.3
0.4
0.5
Tes
0.6
Panos Ipeirotis - Columbia University
0.7
0.8
0.9
1
76
No Support for Conjunctive Queries
(Boolean vs. Vector-space)
1.00
0.95
F1-measure
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
QP-RIPPER (Boolean)
0.50
0.45
QP-SVM (Boolean)
0.40
0.35
QP-RIPPER (Vector)
QP-SVM (Vector)
0.30
0
5/22/2017
0.1
0.2
0.3
0.4
0.5
Tes
0.6
Panos Ipeirotis - Columbia University
0.7
0.8
0.9
1
77
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
Rk
Rk
CORI – Stemming
0.6
0.55
0.6
0.55
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.5
0.45
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.5
0.45
0.4
0.4
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
6
7
8
k (databases selected)
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.75
0.8
0.7
0.75
0.65
0.7
0.6
0.65
Rk
Rk
0.55
0.5
0.6
0.55
0.45
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.4
0.35
0.3
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
0.5
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.45
0.4
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
78
bGlOSS – Stemming
0.8
0.8
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.7
0.7
0.6
0.5
0.5
Rk
Rk
0.6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.1
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
6
7
k (databases selected)
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.7
0.8
0.65
0.75
0.6
0.7
0.55
0.65
Rk
Rk
0.5
0.45
0.4
0.6
0.55
0.35
0.5
0.3
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.25
0.2
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.45
0.4
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
79
LM – Stemming
0.8
0.75
0.7
Rk
0.65
0.6
0.55
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.5
0.45
0.4
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.7
0.8
0.75
0.65
0.7
0.6
0.65
0.6
Rk
Rk
0.55
0.5
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.45
0.55
0.5
0.45
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.4
0.4
0.35
0.35
0.3
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
80
CORI – No Stemming
0.8
0.7
Rk
0.6
0.5
0.4
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.3
0.2
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.8
0.85
0.75
0.8
0.7
0.75
0.65
0.7
0.6
Rk
Rk
0.65
0.55
0.6
0.5
0.55
0.45
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.4
0.35
0.3
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.5
0.45
0.4
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
81
bGlOSS – No stemming
0.8
0.8
0.7
0.7
0.6
0.6
0.5
Rk
Rk
0.5
0.4
0.4
0.3
0.3
0.2
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.1
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.2
0
0.1
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
6
7
k (databases selected)
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.7
0.85
0.8
0.6
0.75
0.5
0.7
Rk
Rk
0.65
0.4
0.6
0.3
0.55
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.2
0.1
0.5
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.45
0.4
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
82
LM – No Stemming
0.8
0.75
0.7
Rk
0.65
0.6
0.55
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.5
0.45
0.4
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC4 QBS
TREC4 FPS
0.7
0.85
0.8
0.65
0.75
0.6
0.7
0.65
Rk
Rk
0.55
0.5
0.6
0.55
0.45
QBS - Shrinkage
QBS - Hierarchical
QBS - Plain
0.4
0.35
0.5
FPS - Shrinkage
FPS - Hierarchical
FPS - Plain
0.45
0.4
0.3
0.35
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
k (databases selected)
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
k (databases selected)
TREC6 QBS
5/22/2017
4
TREC6 FPS
Panos Ipeirotis - Columbia University
83
Frequency Estimation – TREC 4 - CORI
0.75
0.8
0.7
0.75
0.65
0.7
0.6
0.55
Rk
Rk
0.65
0.6
0.5
0.45
0.55
0.4
QBS - Shrinkage - FreqEst
QBS - Shrinkage - NoFreqEst
QBS - Plain - FreqEst
QBS - Plain - NoFreqEst
0.5
0.45
0.4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
QBS - Shrinkage - FreqEst
QBS - Shrinkage - NoFreqEst
QBS - Plain - FreqEst
QBS - Plain - NoFreqEst
0.35
0.3
0.25
19
20
1
2
3
4
5
6
7
8
k (databases selected)
9
10
11
12
13
14
15
16
17
18
19
20
19
20
k (databases selected)
Stemming
0.75
0.8
0.7
0.75
0.65
0.7
0.6
0.65
0.55
Rk
Rk
0.6
0.5
0.55
0.45
0.5
0.4
FPS - Shrinkage - FreqEst
FPS - Shrinkage - NoFreqEst
FPS - Plain - FreqEst
FPS - Plain - NoFreqEst
0.45
0.4
0.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FPS - Shrinkage - FreqEst
FPS - Shrinkage - NoFreqEst
FPS - Plain - FreqEst
FPS - Plain - NoFreqEst
0.35
0.3
0.25
19
20
1
2
3
4
k (databases selected)
5
6
7
8
9
10
11
12
13
14
15
16
17
18
k (databases selected)
No Stemming
5/22/2017
Panos Ipeirotis - Columbia University
84
Frequency Estimation – TREC 6 - CORI
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.6
0.65
0.6
Rk
Rk
0.55
0.5
0.55
0.45
0.4
0.5
QBS - Shrinkage - FreqEst
QBS - Shrinkage - NoFreqEst
QBS - Plain - FreqEst
QBS - Plain - NoFreqEst
0.35
0.3
0.25
0.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FPS - Shrinkage - FreqEst
FPS - Shrinkage - NoFreqEst
FPS - Plain - FreqEst
FPS - Plain - NoFreqEst
0.45
0.4
0.35
20
1
2
3
4
5
6
7
8
k (databases selected)
9
10
11
12
13
14
15
16
17
18
19
20
19
20
k (databases selected)
Stemming
0.75
0.8
0.7
0.75
0.65
0.7
0.6
0.65
Rk
Rk
0.55
0.6
0.5
0.55
0.45
QBS - Shrinkage - FreqEst
QBS - Shrinkage - NoFreqEst
QBS - Plain - FreqEst
QBS - Plain - NoFreqEst
0.4
0.35
0.45
0.4
0.3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FPS - Shrinkage - FreqEst
FPS - Shrinkage - NoFreqEst
FPS - Plain - FreqEst
FPS - Plain - NoFreqEst
0.5
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
k (databases selected)
k (databases selected)
No Stemming
5/22/2017
Panos Ipeirotis - Columbia University
85
Universal Application of Shrinkage –
TREC4 – CORI
0.8
0.7
0.7
0.6
0.6
Rk
Rk
0.5
0.5
0.4
0.4
QBS - Plain
QBS - Universal
QBS - Shrinkage
0.3
0.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0.3
QBS - Plain
QBS - Universal
QBS - Shrinkage
0.2
19
20
1
2
3
4
5
6
7
8
k (databases selected)
9
10
11
12
13
14
15
16
17
18
19
20
19
20
k (databases selected)
0.8
0.8
0.75
0.7
0.7
0.6
Rk
Rk
0.65
0.6
0.5
0.55
0.4
0.5
FPS - Plain
FPS - Universal
FPS - Shrinkage
0.45
0.4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FPS - Plain
FPS - Universal
FPS - Shrinkage
0.3
0.2
19
20
1
2
3
4
k (databases selected)
5/22/2017
5
6
7
8
9
10
11
12
13
14
15
16
17
18
k (databases selected)
Panos Ipeirotis - Columbia University
86
Universal Application of Shrinkage –
TREC4 – bGlOSS
0.8
0.7
0.7
0.6
0.6
0.5
Rk
Rk
0.5
0.4
0.4
0.3
0.3
0.2
QBS - Plain
QBS - Universal
QBS - Shrinkage
0.2
0.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
QBS - Plain
QBS - Universal
QBS - Shrinkage
0.1
0
19
20
1
2
3
4
5
6
7
8
k (databases selected)
10
11
12
13
14
15
16
17
18
19
20
19
20
k (databases selected)
0.8
0.7
0.7
0.6
0.6
0.5
0.5
Rk
Rk
0.8
0.4
0.4
0.3
0.3
FPS - Plain
FPS - Universal
FPS - Shrinkage
0.2
0.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FPS - Plain
FPS - Universal
FPS - Shrinkage
0.2
0.1
19
20
1
2
3
4
k (databases selected)
5/22/2017
9
5
6
7
8
9
10
11
12
13
14
15
16
17
18
k (databases selected)
Panos Ipeirotis - Columbia University
87
Results: Content Summary Quality
 Recall: How many words in
database also in summary?
Shrinkage-based summaries
include 10-90% more words
than unshrunk summaries
100
90
80
70
60
50
40
30
20
10
0
Shrinkage
No Shrinkage
Web
 Precision: How many words in
the summary also in database?
Shrinkage-based summaries
include 5%-15% words not in
actual database
100
90
80
70
60
50
40
30
20
10
0
TREC6
Shrinkage
No Shrinkage
Web
5/22/2017
TREC4
Panos Ipeirotis - Columbia University
TREC4
TREC6
88
Results: Content Summary Quality
 Rank correlation: Is word ranking
in summary similar to ranking in
database?
Shrinkage-based summaries
demonstrate better word rankings
than unshrunk summaries
100
90
80
70
60
50
40
30
20
10
0
Shrinkage
No Shrinkage
Web
TREC4
TREC6
Kullback-Leibler divergence: Is probability distribution in summary
similar to distribution in database?
Shrinkage improves bad cases, making very good ones worse
 Motivates adaptive application of shrinkage!
5/22/2017
Panos Ipeirotis - Columbia University
89
Model: Querying Graph
Words
5/22/2017
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Panos Ipeirotis - Columbia University
90
Model: Reachability Graph
Words
5/22/2017
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
t1
t2
t3
t5
t4
t1 retrieves document d1
that contains t2
Panos Ipeirotis - Columbia University
91