Download Bayesian Extension to the Language Model for Ad Hoc Information

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting Outline       Introduction The Unigram Query Model Bayesian Language Model Relationship to other smoothing models Empirical evaluation Conclusion Introduction    To propose a Bayesian extension of the Language model. Bayesian statistics provide a powerful yet intuitive mathematical framework for data modeling when the data is scarce and/or uncertain and there is some prior knowledge about it. To derive analytically the predictive distribution of the most commonly used query Language Model. The Unigram Query Model  The specific form of Language Model－ query unigram model.  Consider a query q and a document collection of N documents C:= {d l }l 1,....., N , both queries and documents being represented as vectors of indexed term counts： q : (q1 ,....., qi ,....., qV )  N V d l : (dl ,1 ,....., dl ,i ,....., dl ,V )  N V  Furthermore, consider a multinomial generation model for each document： V l : (l ,1 ,.....,l ,i ,.....,l ,V )  [0,1] , l ,i  1 V i 1 The Unigram Query Model  The specific form of Language Model－ query unigram model.   Finally, let us also define the length of a query and of ( nq ) a document (nl ) as the sum of their components. Under this model, the probability of generating a particular query q with counts q is given by the V n ! n product：   q 1 Z   P(q | l )  Z q  (l ,i ) q ,....., q    q! 1 dl i 1 i 1  q V q V i 1 i similarly for documents: P(dl | l )  Z 1 dl V  ( i 1 l ,i ) dl ,i nl   nl ! Z dl1     V d ,....., d l ,1 l , V   i 1 dl ,i ! The Unigram Query Model  The unigram query model postulates that the relevance of a document to a query can be measured by probability that the query is generated by the document.    l  dl Given an infinite amount of data, the value could be easily estimated by their empirical estimates: d ˆl ,i  l ,i nl Problems：   Underestimating the probability of rare words and overestimating the probability of frequent ones. Unseen words.  Smoothed estimates. Bayesian Language Model  The problem of small data samples and resulting parameter uncertainty suggests the use of Bayesian techniques.   Bayes’ rule： P(l | dl )  P(l ) P(dl | l ) P(dl ) A more powerful approach is to take account of posterior uncertainty when evaluating the probability of q by computing the predictive distribution: P(q | dl )   P(q | l ) P(l | dl ) dl   In the case of abundant data, it can be seen that P(q | dl )  P(q | ˆl ) Bayesian Language Model    The choice of prior probability distributions is central to Bayesian inference. In most cases the only available choice for a prior is the natural conjugate of the generating distribution. The natural conjugate of a multinomial distribution is the Dirichlet distribution： V P( )  Z  (i )i 1 Z  i 1    n   V i 1   i  Posterior distribution is Dirichelet as well： P( )    ndl  n  V i 1    dl   i   V dl  i 1 (  )  i i 1 Bayesian Language Model  Finally, we can compute the predictive distribution in (7) as follows：  By taking the log and separating out the terms in the query not appearing in the document. Bayesian Language Model  Setting      A better option is to fit the prior distribution to the collection statistics, since these are known. In the absence of any other information, we will assume that the documents resemble the average document. The average term count for term  l d l ,i P (vi | C ) :  l nl By mean of the posterior distribution in (3) is known to be i  i n  P(vi | C ) which can be satisfied setting  i   P(vi | C ) and n   Relationship to other smoothing models  A standard approximation to the Bayesian predictive distribution (7) is the so called Maximum posterior (MP) distribution. V qi P(q | d )  P(q | lMP )  Z q1  (lMP ) ,i i 1   For a Dirichelet prior, the maximum posterior distribution is known analytically d l ,i   i  1 MP  l ,i  nl  n  V Setting  i =1  Maximum likelihood estimator： d l ,i ML  l ,i  nl Relationship to other smoothing models  Setting  i =2 or  i =   1 estimators：    LA1 l ,i  d l ,i  1 ,   Laplace smoothing LA l ,i  d l ,i   nl  V nl   V Setting  i =  P(vi | C )  Bayes-smoothing estimate： dl ,i   P(vi | C ) BS  l ,i  nl   These different smoothed estimators are approximations to the full estimator obtained by 1) replacing the predictive distribution by the maximum posterior distribution. And 2) choosing a particular value of  Relationship to other smoothing models  The linear interpolation (LI) smoothing： ML lLI    (1   ) P(vi | C ) ,i l ,i   We first note that in the case of BS and LI the probability of and unseen word can be written as l P(vi | C ) there l is a document dependant constant and P(vi | C ) was previsouly defined. To rewrite the unigram query model (3) as： Relationship to other smoothing models  Bayesian predictive model proposed in this paper：   The number of operations to compute the first term depends now on the number of terms matching. The last term cannot be pre-computed since it depends on the query, but its computational cost remains neigligible. Empirical evaluation    TREC-8 document collection TREC-6 and TREC-8 queries and query relevance sets. Data pre-processing was standard.   Terms were stemmed Stop words were removed as well as words appearing fewer than 3 times. Empirical evaluation Empirical evaluation Empirical evaluation Conclusion    We proposed a new scoring function derived from the Bayesian predictive distribution, which many smoothed estimators approximate. We have shown that its computational cost is only slightly greater than that of other existing estimators. The new model performs better than Bayessmoothing but not as well as linear interpolation smoothing.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bayesian Extension to the Language Model for Ad Hoc Information