Download Scalable Methods for Estimating Document Frequencies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Algorithm wikipedia , lookup

Pattern recognition wikipedia , lookup

Resonance wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Operational transformation wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Selection algorithm wikipedia , lookup

Security printing wikipedia , lookup

Transcript
Scalable Methods for Estimating
Document Frequencies of
Collocations in Databases
Tan Yee Fan
2006 December 15
WING Group Meeting
Motivation

Input:


Output:


A list of items, L
For each a, b in L, the
cooccurrence value
between a and b
Often done by querying
some database for
document frequencies

e.g. f(a), f(b) and f(a  b)

Many cooccurrence
measures need f(a  b)
Problem Statement

Input


Output


A list of items, L
For each a, b in L, the document frequency
f(a  b) in database
Naïve pairwise algorithm need O(n2) queries


Not scalable (e.g. n ~ 1000)
Bandwidth issues and server overload
Related Work

C-PANKOW (Cimiano et al., 2005)


POLYPHONET (Matsuo et al., 2006)


Matching named entities to concepts
Building a social network
Avoid pairwise queries as far as possible


Both C-PANKOW and POLYPHONET perform
“document processing” to achieve this goal
Is document processing really necessary?
Related Work

QProber
(Ipeirotis, 2002)



Obtain a sample of
documents from
database
Select some words to
query and fit a power law
curve
Estimate document
frequencies of the rest
Figure from Ipeirotis (2002)
This Project

Extend QProber algorithm to collocations

Algorithm



Obtain a sample of documents from database
Select some collocations to query and fit a power
law curve
Estimate document frequencies of the rest
Query Selection Strategy

Query selection strategy


For each word w, order collocations in sampled
documents containing w by rank
Uniformly select q collocations to query
decreasing rank

Use O(qn) queries, with q << n
Experiment



Database of 2000 newsgroup articles
Evaluated on a lexicon of 100 words
Vary sample size s and number of queries q
Conclusion

Possible to estimate document frequencies of
collocations reliably using O(n) queries

Next step

Can the methods be applied to disambiguating
author names, publication venue titles, etc.?
Additional Slides
Estimating Actual Document Frequencies

Alternative method


For each word w, fit a power law curve using the
collocations containing w
Estimation for unknown collocation w1 w2:


Average the values estimated from the curve of w1 and
the curve of w2
Problem

Quality of each curve is not as good as lesser
training examples used
Query Selection Strategy

Alternative strategy

Uniform selection of
collocations to query
without regard to
frequencies

Problem

Together with alternative
method, can produce
large errors due to
selection of collocations
at the tail of the power
law curve to query