Download Scalable Methods for Estimating Document Frequencies

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting Motivation  Input:   Output:   A list of items, L For each a, b in L, the cooccurrence value between a and b Often done by querying some database for document frequencies  e.g. f(a), f(b) and f(a  b)  Many cooccurrence measures need f(a  b) Problem Statement  Input   Output   A list of items, L For each a, b in L, the document frequency f(a  b) in database Naïve pairwise algorithm need O(n2) queries   Not scalable (e.g. n ~ 1000) Bandwidth issues and server overload Related Work  C-PANKOW (Cimiano et al., 2005)   POLYPHONET (Matsuo et al., 2006)   Matching named entities to concepts Building a social network Avoid pairwise queries as far as possible   Both C-PANKOW and POLYPHONET perform “document processing” to achieve this goal Is document processing really necessary? Related Work  QProber (Ipeirotis, 2002)    Obtain a sample of documents from database Select some words to query and fit a power law curve Estimate document frequencies of the rest Figure from Ipeirotis (2002) This Project  Extend QProber algorithm to collocations  Algorithm    Obtain a sample of documents from database Select some collocations to query and fit a power law curve Estimate document frequencies of the rest Query Selection Strategy  Query selection strategy   For each word w, order collocations in sampled documents containing w by rank Uniformly select q collocations to query decreasing rank  Use O(qn) queries, with q << n Experiment    Database of 2000 newsgroup articles Evaluated on a lexicon of 100 words Vary sample size s and number of queries q Conclusion  Possible to estimate document frequencies of collocations reliably using O(n) queries  Next step  Can the methods be applied to disambiguating author names, publication venue titles, etc.? Additional Slides Estimating Actual Document Frequencies  Alternative method   For each word w, fit a power law curve using the collocations containing w Estimation for unknown collocation w1 w2:   Average the values estimated from the curve of w1 and the curve of w2 Problem  Quality of each curve is not as good as lesser training examples used Query Selection Strategy  Alternative strategy  Uniform selection of collocations to query without regard to frequencies  Problem  Together with alternative method, can produce large errors due to selection of collocations at the tail of the power law curve to query

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Scalable Methods for Estimating Document Frequencies