Download K - Villanova Computer Science

Document related concepts

Foundations of statistics wikipedia , lookup

Transcript
Information Retrieval
Chap. 02: Modeling - Part 2
Slides from the text book author, modified by
L N Cassel
September 2003
Probabilistic Model






Objective: to capture the IR problem using a
probabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of
this ideal answer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e.,
guess initial description of ideal answer set)
Improve by iteration
Probabilistic Model






An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be inspected)
IR system uses this information to refine
description of ideal answer set
By repeating this process, it is expected that the
description of the ideal answer set will improve
Have always in mind the need to guess at the very
beginning the description of the ideal answer set
Description of ideal answer set is modeled in
probabilistic terms
Probabilistic Ranking Principle

Given a user query q and a document dj, the
probabilistic model tries to estimate the probability
that the user will find the document dj interesting
(i.e., relevant).
 The
model assumes that this probability of relevance
depends on the query and the document
representations only.
 Ideal answer set is referred to as R and should
maximize the probability of relevance. Documents in
the set R are predicted to be relevant.

But,
 how
to compute probabilities?
 what is the sample space?
The Ranking

Probabilistic ranking computed as:
sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)
This is the odds of the document dj being relevant
Taking the odds minimizes the probability of an
erroneous judgement

Definition:
wij  {0,1}
P(R | dj) : probability that given document is relevant
P(R | dj) : probability document is not relevant
The Ranking

sim(dj,q) = P(R | dj) / P(R | dj)
= [P( dj | R) * P(R)]
[P( dj | R) * P(R)]
~


P( dj | R)
P( dj | R)
P( dj | R) : probability of randomly selecting the
document dj from the set R of relevant documents
P(R): probability that a document selected at random
from the whole set of documents is relevant. P(R)
and P(R) are the same.
The Ranking

sim(dj,q) ~
~

P(dj | R)
P(dj | R)
[  P(ki | R)] * [  P(ki | R)]
[  P(ki | R)] * [  P(ki | R)]
P(ki | R) : probability that the index term ki is present
in a document randomly selected from the set R of
relevant documents
The Ranking

sim(dj,q) ~ log
[  P(ki | R)] * [  P(kj | R)]
[  P(ki | R)] * [  P(kj | R)]
~ K * [ log
log
 P(ki | R)
+
P(ki | R)
 P(ki | R) ]
P(ki | R)
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
where
P(ki | R) = 1 - P(ki | R)
P(ki | R) = 1 - P(ki | R)
The Initial Ranking



sim(dj,q) ~
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
Probabilities P(ki | R) and P(ki | R) ?
Estimates based on assumptions:
 P(ki
| R) = 0.5
 P(ki | R) = ni
N
where ni is the number of docs that contain ki
 Use this initial guess to retrieve an initial ranking
 Improve upon this initial ranking
Improving the Initial Ranking


sim(dj,q) ~
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
Let
V
: set of docs initially retrieved
 Vi : subset of docs retrieved that contain ki

Reevaluate estimates:
 P(ki
= Vi
V
 P(ki | R) = ni - Vi
N-V

| R)
Repeat recursively
Improving the Initial Ranking


sim(dj,q) ~
~  wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R)
P(ki | R)
To avoid problems with V=1 and Vi=0:
 P(ki
= Vi + 0.5
V + 1
 P(ki | R) = ni - Vi + 0.5
N-V+1

| R)
Also,
 P(ki
| R)
= Vi + ni/N
V + 1
 P(ki | R) = ni - Vi + ni/N
N-V+1
Pluses and Minuses

Advantages:
 Docs
ranked in decreasing order of probability of
relevance

Disadvantages:
 need
to guess initial estimates for P(ki | R)
 method does not take into account tf and idf factors
Brief Comparison of Classic Models



Boolean model does not provide for partial matches
and is considered to be the weakest classic model
Salton and Buckley did a series of experiments that
indicate that, in general, the vector model outperforms
the probabilistic model with general collections
This seems also to be the view of the research
community
Set Theoretic Models



The Boolean model imposes a binary criterion for
deciding relevance
The question of how to extend the Boolean model to
accomodate partial matching and a ranking has
attracted considerable attention in the past
We discuss now two set theoretic models for this:
Fuzzy Set Model
 Extended Boolean Model (We will not discuss because of
time limitations)

Fuzzy Set Model


Queries and docs represented by sets of index
terms: matching is approximate from the start
This vagueness can be modeled using a fuzzy
framework, as follows:
 with
each term is associated a fuzzy set
 each doc has a degree of membership in this fuzzy
set


This interpretation provides the foundation for
many models for IR based on fuzzy theory
In here, we discuss the model proposed by
Ogawa, Morita, and Kobayashi (1991)
Fuzzy Set Theory




Framework for representing classes whose
boundaries are not well defined
Key idea is to introduce the notion of a degree of
membership associated with the elements of a set
This degree of membership varies from 0 to 1 and
allows modeling the notion of marginal membership
Thus, membership is now a gradual notion, contrary
to the crispy notion enforced by classic Boolean logic
Fuzzy Set Theory

Definition
 A fuzzy
subset A of U is characterized by a
membership function
(A,u) : U  [0,1]
which associates with each element u of U a number
(u) in the interval [0,1]

Definition
 Let A and
B be two fuzzy subsets of U. Also, let ¬A be
the complement of A. Then,



(¬A,u) = 1 - (A,u)
(AB,u) = max((A,u), (B,u))
(AB,u) = min((A,u), (B,u))
Fuzzy Information Retrieval


Fuzzy sets are modeled based on a thesaurus
This thesaurus is built as follows:
 Let
c be a term-term correlation matrix
 Let
c(i,l) be a normalized correlation factor for (ki,kl):
c(i,l) =
n(i,l)
ni + nl - n(i,l)
• ni: number of docs which contain ki
• nl: number of docs which contain kl
• n(i,l): number of docs which contain both ki and kl

We now have the notion of proximity among index
terms.
Fuzzy Information Retrieval

The correlation factor c(i,l) can be used to define
fuzzy set membership for a document dj as follows:
(i,j) = 1 -  (1 - c(i,l))
ki  dj
(i,j) : membership of doc dj in fuzzy subset associated
with ki


The above expression computes an algebraic sum
over all terms in the doc dj (shown here as
complement of negated algebraic product)
A doc dj belongs to the fuzzy set for ki, if its own
terms are associated with ki
Fuzzy Information Retrieval
(i,j) = 1 -  (1 - c(i,l))
ki  dj

• (i,j) : membership of doc dj in fuzzy subset associated with ki

If doc dj contains a term kl which is closely related to
ki, we have
 c(i,l)
~ 1 (correlation between terms i, I is high)
 (i,j) ~ 1 (membership of term i in document j is high)
 index ki is a good fuzzy index for document j.
Fuzzy IR: An Example
Ka
cc3
Kb
cc2
cc1



q = ka  (kb  kc)
qdnf = (1,1,1) + (1,1,0) + (1,0,0)
= cc1 + cc2 + cc3
(q,dj) = (cc1+cc2+cc3,j)
= 1 - (1 - (a,j) (b,j) (c,j)) *
(1 - (a,j) (b,j) (1-(c,j))) *
(1 - (a,j) (1-(b,j)) (1-(c,j)))
Kc
(1,1,1)
(1,1,0)
(1,0,0)
Exercise - put some numbers in the areas and calculate the
actual value of (q,dj)
Fuzzy Information Retrieval



Fuzzy IR models have been discussed mainly in the
literature associated with fuzzy theory
Experiments with standard test collections are not
available
Difficult to compare at this time
Alternative Probabilistic Models

Probability Theory
 Semantically
clear
 Computationally clumsy

Why Bayesian Networks?
 Clear
formalism to combine evidences
 Modularize the world (dependencies)
 Bayesian Network Models for IR
 Inference
Network (Turtle & Croft, 1991)
 Belief Network (Ribeiro-Neto & Muntz, 1996)
Bayesian Inference
Basic Axioms:
0

< P(A) < 1 ;
P(sure)=1;
 P(A
V B)=P(A)+P(B)
if A and B are mutually
exclusive
Bayesian Inference
Other formulations
 B)+P(A  ¬B)
 P(A)= i P(A  Bi) , where Bi,i is a set of
exhaustive and mutually exclusive events
 P(A) + P(¬A) = 1
 P(A|K)
belief in A given the knowledge K
 if P(A|B)=P(A), we say:A and B are independent
 if P(A|B  C)= P(A|C), we say: A and B are
conditionally independent, given C
 P(A  B)=P(A|B)P(B)
 P(A)= i P(A | Bi)P(Bi)
 P(A)=P(A
Bayesian Inference
Bayes’ Rule : the heart of Bayesian techniques
P(H|e) = P(e|H)P(H) / P(e)
Where,
H : a hypothesis and e is an evidence
P(H) : prior probability
P(H|e) : posterior probability
P(e|H) : probability of e if H is true
P(e) : a normalizing constant, then we
write:
P(H|e) ~ P(e|H)P(H)
Bayesian Networks
Definition:
Bayesian networks are directed acyclic graphs
(DAGS) in which the nodes represent random
variables, the arcs portray causal relationships
between these variables, and the strengths of
these causal influences are expressed by
conditional probabilities.
Bayes - resource


Look at
http://members.aol.com/johnp71/bayes.html
Bayesian Networks
y1
y2 … yt
x
yi : parent nodes (in this case, root nodes)
x : child node
yi cause x
Y the set of parents of x
The influence of Y on x can be quantified by any
function
F(x,Y) such that x F(x,Y) = 1
0 < F(x,Y) < 1
For example, F(x,Y)=P(x|Y)
Bayesian Networks
xx1
1
Given the dependencies declared
in a Bayesian Network, the
x3
x2
x2
x3
expression for the joint
probability can be computed as
a product of local conditional
probabilities, for example,
x4
x4
P(x1, x2, x3, x4, x5)=
P(x1 ) P(x2| x1 ) P(x3| x1 ) P(x4| x2, x3 ) P(x5| x3 ).
P(x1 ) : prior probability of the root node
x5
x5
Bayesian Networks
x1
x1
In a Bayesian network each
variable x is conditionally
independent of all its
non-descendants, given its
parents.
x3
x3
x2
x2
x4
x4
For example:
P(x4, x5| x2 , x3)= P(x4| x2 , x3) P( x5| x3)
x5
x5
Inference Network Model



Epistemological view of the IR problem
Random variables associated with documents,
index terms and queries
A random variable associated with a document
dj represents the event of observing that
document
Inference Network Model
Nodes
dj
k1
k2
…
q2
or
or
I
…
kt
Edges
and
q
ki
documents (dj)
index terms (ki)
queries (q, q1, and q2)
user information need (I)
q1
from dj to its index term nodes ki
indicate that the observation of
dj increase the belief in the
variables ki
.
Inference Network Model
dj
k1
k2
…
and
q
q2
or
or
I
q1
ki
…
kt
dj has index terms k2, ki, and kt
q has index terms k1, k2, and ki
q1 and q2 model boolean formulation
q1=((k1 k2) v ki);
I = (q v q1)
Inference Network Model
Definitions:
k1, dj,, and q random variables.
k=(k1, k2, ...,kt) a t-dimensional vector
ki,i{0, 1}, then k has 2t possible states
dj,j{0, 1}; q{0, 1}
The rank of a document dj is computed as P(q dj)
q and dj,are short representations for q=1 and dj =1
(dj stands for a state where dj = 1 and lj  dl =0,
because we observe one document at a time)
Inference Network Model
P(q  dj) = k P(q  dj| k) P(k)
= k P(q  dj  k)
= k P(q | dj  k) P(dj  k)
= k P(q | k) P(k | dj ) P( dj )
P(¬(q  dj)) = 1 - P(q  dj)
Inference Network Model
As the instantiation of dj makes all index term nodes
mutually independent P(k | dj ) can be a product,then
P(q  dj)
= k [ P(q | k)
(i|gi(k)=1
(i|gi(k)=0
P( dj )]
remember that: gi(k)= 1
P(ki | dj ))
P(¬ki | dj))
if ki=1 in the vector k
0 otherwise
Inference Network Model
The prior probability P(dj) reflects the probability associated
to the event of observing a given document dj

Uniformly for N documents
P(dj) = 1/N
P(¬dj) = 1 - 1/N

Based on norm of the vector dj
P(dj)= 1/|dj|
P(¬dj) = 1 - 1/|dj|
Inference Network Model
For the Boolean Model
P(dj) = 1/N
P(ki | dj) = 1 if gi(dj)=1
0 otherwise
P(¬ki | dj) = 1 - P(ki | dj)
 only nodes associated with the index terms of the
document dj are activated
Inference Network Model
For the Boolean Model
1 if qcc | (qcc qdnf)  ( ki, gi(k)= gi(qcc)
P(q | k) =
0 otherwise
P(¬q | k) = 1 - P(q | k)
 one of the conjunctive components of the query
must be matched by the active index terms in k
Inference Network Model
For a tf-idf ranking strategy
P(dj)= 1 / |dj|
P(¬dj) = 1 - 1 / |dj|
 prior probability reflects the importance of
document normalization
Inference Network Model
For a tf-idf ranking strategy
P(ki | dj) = fi,j
P(¬ki | dj)= 1- fi,j
 the relevance of the a index term ki is
determined by its normalized term-frequency
factor fi,j = freqi,j / max freql,j
Inference Network Model
For a tf-idf ranking strategy
Define a vector ki given by
ki = k | ((gi(k)=1)  (ji gj(k)=0))
 in the state ki only the node ki is active and all
the others are inactive
Inference Network Model
For a tf-idf ranking strategy
idfi
if k = ki  gi(q)=1
P(q | k) =
0
if k  ki v gi(q)=0
P(¬q | k) = 1 - P(q | k)
 we can sum up the individual contributions of each index
term by its normalized idf
Inference Network Model
For a tf-idf ranking strategy
As P(q|k)=0 k  ki, we can rewrite P(q  dj) as
P(q  dj) = ki [ P(q | ki) P(ki | dj )
(l|li
P(¬kl | dj)) P( dj )]
= (i P(¬kl | dj)) P( dj )
ki [P(ki | dj ) P(q | ki) / P(¬ki| dj)]
Inference Network Model
For a tf-idf ranking strategy
Applying the previous probabilities we have
P(q  dj) = Cj (1/|dj|) i [fi,j idfi (1/(1- fi,j ))]
 Cj vary from document to document
 the ranking is distinct of the one provided by
the vector model
Inference Network Model
Combining evidential source
Let I = q v q1
P(I  dj) = k P(I | k) P(k | dj ) P( dj)
= k [1 - P(¬q|k)P(¬q1| k)] P(k| dj ) P( dj)
 it might yield a retrieval performance which
surpasses the retrieval performance of the query
nodes in isolation (Turtle & Croft)
Belief Network Model

As the Inference Network Model
 Epistemological
view of the IR problem
 Random variables associated with documents,
index terms and queries

Contrary to the Inference Network Model
 Clearly
defined sample space
 Set-theoretic view
 Different network topology
Belief Network Model
The Probability Space
Define:
K={k1, k2, ...,kt} the sample space (a concept space)
u  K a subset of K (a concept)
ki an index term (an elementary concept)
k=(k1, k2, ...,kt) a vector associated to each u such that
gi(k)=1  ki  u
ki a binary random variable associated with the index
term ki , (ki = 1  gi(k)=1  ki  u)
Belief Network Model
A Set-Theoretic View
Define:
a document dj and query q as concepts in K
a generic concept c in K
a probability distribution P over K, as
P(c)=uP(c|u) P(u)
P(u)=(1/2)t
P(c) is the degree of coverage of the space K
by c
Belief Network Model
Network topology
q
query side
k1
ki
k2
kt
ku
document side
d1
dj
dn
Belief Network Model
Assumption
P(dj|q) is adopted as the rank of the document dj with
respect to the query q. It reflects the degree of
coverage provided to the concept dj by the concept
q.
Belief Network Model
The rank of dj
P(dj|q) = P(dj  q) / P(q)
~ P(dj  q)
~ u P(dj  q | u) P(u)
~ u P(dj | u) P(q | u) P(u)
~ k P(dj | k) P(q | k) P(k)
Belief Network Model
For the vector model
Define
Define a vector ki given by
ki = k | ((gi(k)=1)  (ji gj(k)=0))
 in the state ki only the node ki is active and all
the others are inactive
Belief Network Model
For the vector model
Define
(wi,q / |q|)
if k = ki  gi(q)=1
P(q | k) =
0
if k  ki v gi(q)=0
P(¬q | k) = 1 - P(q | k)
 (wi,q / |q|) is a normalized version of weight of
the index term ki in the query q
Belief Network Model
For the vector model
Define
(wi,j / |dj|)
if k = ki  gi(dj)=1
P(dj | k) =
0
if k  ki v gi(dj)=0
P(¬ dj | k) = 1 - P(dj | k)
 (wi,j / |dj|) is a normalized version of the weight
of the index term ki in the document d,j
Bayesian Network Models
Comparison
 Inference
Network Model is the first and well known
 Belief Network adopts a set-theoretic view
 Belief Network adopts a clearly define sample space
 Belief Network provides a separation between query
and document portions
 Belief Network is able to reproduce any ranking
produced by the Inference Network while the
converse is not true (for example: the ranking of the
standard vector model)
Bayesian Network Models
Computational costs
 Inference
Network Model one document node at a
time then is linear on number of documents
 Belief Network only the states that activate each
query term are considered
 The networks do not impose additional costs
because the networks do not include cycles.
Bayesian Network Models
Impact
The major strength is net combination of distinct
evidential sources to support the rank of a given
document.