Download Introduction to Information Retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Information Retrieval
Introduction to
Information Retrieval
Hinrich Schütze and Christina Lioma
Lecture 6: Scoring, Term Weighting, The Vector
Space Model
1
Introduction to Information Retrieval
Overview
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
2
Introduction to Information Retrieval
Outline
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
3
Introduction to Information Retrieval
Problem with Boolean search
 Thus far, our queries have all been Boolean.
• Documents either match or don’t.
 Good for expert users with precise understanding of their
needs and of the collection.
 Not good for the majority of users
• Most users are not capable of writing Boolean (or they are, but
they think it’s too much work.)
• Most users don’t want to wade through 1000s of results.
4
Introduction to Information Retrieval
Problem with Boolean search: Feast or famine
 Boolean queries often result in either too few or too many
(1000s) results.
 Feast
• Query : “standard user dlink 650”  200,000 hits
 Famine
• Query : “standard user dlink 650 no card found”  2 hits
 In Boolean retrieval, it takes a lot of skill to come up with a
query that produces a manageable number of hits.
 AND gives too few, and OR gives too many.
5
Introduction to Information Retrieval
Feast or famine: No problem in ranked retrieval




With ranking, large result sets are not an issue.
Just show the top 10 results
Doesn’t overwhelm the user
Premise: the ranking algorithm works: More relevant results
are ranked higher than less relevant results.
6
Introduction to Information Retrieval
Query-document matching scores
 How do we compute the score of a query-document pair?
 Let’s start with a one-term query.
 If the query term does not occur in the document: score
should be 0.
 The more frequent the query term in the document, the
higher the score (should be)
 We will look at a number of alternatives for doing this.
7
Introduction to Information Retrieval
Outline
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
8
Introduction to Information Retrieval
Parametric and zone indexes
 Metadata- we mean specific forms of data about a document,
such as its author(s), title and data of publication.
 The metadata would generally include fields such as the data
of creation and the format of the document, as well the
author and possibly the title of the document.
 Zones are similar to fields, except the contents of a zone can
be arbitrary free text.
9
Introduction to Information Retrieval
Parametric index
 Parametric search
10
Introduction to Information Retrieval
Zone index
 Basic zone index
 Zone index in which the zone is encoded
steven
11
Introduction to Information Retrieval
Weighted zone scoring
 Weighted zone scoring is sometimes referred to also as
ranked Boolean retrieval
 Consider a set of documents contributes a Boolean value.
12
Introduction to Information Retrieval
Weighted zone scoring-Example
 Consider the query steven in a collection in which each
document has three zones: abstract, title and author. The
Boolean score function for a zone takes on the value 1 if the
query term steven is present in the zone, and 0 otherwise.
Weighted zone scoring in such a collection would require
three weights g1,g2 and g3, respectively corresponding to
abstract, title and author zones.
Suppose we set g1=0.2, g2=0.3, and g3=0.5
steven
13
Introduction to Information Retrieval
Algorithm for computing the weighted zone
score from two postings lists
14
Introduction to Information Retrieval
Learning weights
 Given a query q and a document d, we use the given Boolean
match function to compute Boolean variables ST(d,q) and
SB(d,q).
 Train examples, each of which is a triple of the form
15
Introduction to Information Retrieval
Learning weights
16
Introduction to Information Retrieval
The optimal weight g
17
Introduction to Information Retrieval
The optimal weight g
 Total error=0.75
g=
=0.25
18
Introduction to Information Retrieval
Outline
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
19
Introduction to Information Retrieval
Bag of words model
 The exact ordering of the terms in a document is ignored
but the number of occurrence of each term is material.
 We only retain information on the number of occurrences
of each term.
 “John is quicker than Mary” and “Mary is quicker than
John” are represented the same way.
 This is called a bag of words model.
 For now: bag of words model
20
Introduction to Information Retrieval
Term frequency tf
 The term frequency tft,d of term t in document d is defined
as the number of occurrences of term t in document d.
 We want to use tf when computing query-document match
scores.
 But how?
 Raw term frequency is not what we want because:
• A document with tf = 10 occurrences of the term is
more relevant than a document with tf = 1 occurrence
of the term.
• But not 10 times more relevant.
 Relevance does not increase proportionally with term
frequency.
21
Introduction to Information Retrieval
Log-frequency weighting
 The log frequency weight of term t in d is
• tft,d → wft,d :
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
 Score for a document-query pair: sum over terms t in both
q and d:
Score 
(1  log tf t ,d )

tqd
 The score is 0 if none of the query terms is present in the
document.
22
Introduction to Information Retrieval
Outline
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
23
Introduction to Information Retrieval
Collection frequency vs. Document frequency
 Collection frequency(cft)
• The total number of occurrences of a term t in the
collection.
 Document frequency(dft)
• The number of documents in the collection that contain
a term t.
24
Introduction to Information Retrieval
Desired weight for rare terms
 Rare terms are more informative than frequent terms.
 Consider a term in the query that is rare in the collection
(e.g., ARACHNOCENTRIC).
 A document containing this term is very likely to be
relevant.
 → We want high weights for rare terms like
ARACHNOCENTRIC.
25
Introduction to Information Retrieval
Desired weight for frequent terms
 Frequent terms are less informative than rare terms.
 Consider a term in the query that is frequent in the
collection (e.g., GOOD, INCREASE, LINE).
 A document containing this term is more likely to be
relevant than a document that doesn’t . . .
 . . . but words like GOOD, INCREASE and LINE are not sure
indicators of relevance.
 → For frequent terms like GOOD, INCREASE and LINE, we
want positive weights . . .
 . . . but lower weights than for rare terms.
26
Introduction to Information Retrieval
Inverse Document Frequency
 We want high weights for rare terms like ARACHNOCENTRIC.
 We want low (positive) weights for frequent words like
GOOD, INCREASE and LINE.
 We will use document frequency to factor this into
computing the matching score.
27
Introduction to Information Retrieval
idf weight
 dft is an inverse measure of the informativeness of term t.
 We define the idft weight of term t as follows:
(N is the number of documents in the collection.)
 idft is a measure of the informativeness of the term.
28
Introduction to Information Retrieval
Examples for idf
 The Reuters collection of 806,791 documents
 Compute idft using the formula:
29
Introduction to Information Retrieval
tf-idf weighting
 The tf-idf weighting scheme assigns to term t a weight in
document d given by
 Highest when t occurs many times within a small number
of documents.
 Lower when the term occurs fewer times in a document, or
occurs in many documents.
 Lowest when the term occurs in virtually all documents.
30
Introduction to Information Retrieval
Examples for tf-idf
31
Introduction to Information Retrieval
Variant tf-idf functions
 wf-idft,d weighting
32
Introduction to Information Retrieval
Outline
❶
Why ranked retrieval?
❷
Weighted zone scoring
❸
Term frequency
❹
tf-idf weighting
❺
The vector space model
33
Introduction to Information Retrieval
Binary incidence matrix
Anthony Julius
and
Caesar
Cleopatra
ANTHONY
BRUTUS
CAESAR
CALPURNIA
CLEOPATRA
MERCY
WORSER
...
1
1
1
0
1
1
1
The
Hamlet
Tempest
1
1
1
1
0
0
0
0
0
0
0
0
1
1
Othello
0
1
1
0
0
1
1
Macbeth
...
0
0
1
0
0
1
1
1
0
1
0
0
1
0
Each document is represented as a binary vector ∈ {0, 1}|V|.
34
Introduction to Information Retrieval
Count matrix
Anthony Julius
and
Caesar
Cleopatra
ANTHONY
BRUTUS
CAESAR
CALPURNIA
CLEOPATRA
MERCY
WORSER
...
157
4
232
0
57
2
2
73
157
227
10
0
0
0
The
Hamlet
Tempest
0
0
0
0
0
3
1
Othello
0
2
2
0
0
8
1
Macbeth
...
0
0
1
0
0
5
1
1
0
0
0
0
8
5
Each document is now represented as a count vector ∈ N|V|.
35
Introduction to Information Retrieval
Binary → count → weight matrix
Anthony Julius
and
Caesar
Cleopatra
ANTHONY
BRUTUS
CAESAR
CALPURNIA
CLEOPATRA
MERCY
WORSER
...
5.25
1.21
8.59
0.0
2.85
1.51
1.37
3.18
6.10
2.54
1.54
0.0
0.0
0.0
The
Hamlet
Tempest
0.0
0.0
0.0
0.0
0.0
1.90
0.11
0.0
1.0
1.51
0.0
0.0
0.12
4.15
Othello
0.0
0.0
0.25
0.0
0.0
5.25
0.25
Macbeth
...
0.35
0.0
0.0
0.0
0.0
0.88
1.95
Each document is now represented as a real-valued vector of
wf-idf weights ∈ R|V|.
36
Introduction to Information Retrieval
Documents as vectors
 Each document is now represented as a real-valued vector
of tf-idf weights ∈ R|V|.
 So we have a |V|-dimensional real-valued vector space.
 Terms are axes of the space.
 Documents are points or vectors in this space.
 Very high-dimensional: tens of millions of dimensions when
you apply this to web search engines
 Each vector is very sparse - most entries are zero.
37
Introduction to Information Retrieval
Queries as vectors
 Key idea 1: do the same for queries: represent them as
vectors in the high-dimensional space.
 Key idea 2: Rank documents according to their proximity to
the query.
 proximity = similarity.
 proximity ≈ negative distance.
 Rank relevant documents higher than non-relevant
documents.
38
Introduction to Information Retrieval
How do we formalize vector space similarity?




First cut: distance between two points
( = distance between the end points of the two vectors)
Euclidean distance?
Euclidean distance is a bad idea because Euclidean distance
is large for vectors of different lengths.
39
Introduction to Information Retrieval
Use angle instead of distance
 Rank documents according to angle with query
 Thought experiment: take a document d and append it to
itself. Call this document d′. d′ is twice as long as d.
 d and d′ have the same content.
 The angle between the two documents is 0, corresponding
to maximal similarity even though the Euclidean distance
between the two documents can be quite large.
 The following two notions are equivalent.
• Rank documents according to the angle between query and
document in decreasing order
• Rank documents according to cosine(query,document) in
increasing order
40
Introduction to Information Retrieval
Length normalization
 How do we compute the cosine?
 A vector can be (length-) normalized by dividing each of its
components by its length – here we use the L2 norm:
 This maps vectors onto the unit sphere . . .
 . . . since after normalization:
 As a result, longer documents and shorter documents have
weights of the same order of magnitude.
 Effect on the two documents d and d′ (d appended to itself)
from earlier slide: they have identical vectors after lengthnormalization.
41
Introduction to Information Retrieval
Cosine similarity between query and
document




qi is the tf-idf weight of term i in the query.
di is the tf-idf weight of term i in the document.
| | and | | are the lengths of and
This is the cosine similarity of and . . . . . . or,
equivalently, the cosine of the angle between and
42
Introduction to Information Retrieval
Cosine for normalized vectors
 For normalized vectors, the cosine is equivalent to the dot
product or scalar product.
 (if
and
are length-normalized).
43
Introduction to Information Retrieval
Cosine similarity illustrated
44
Introduction to Information Retrieval
Cosine: Example1
v(Doc1)•v(Doc2)=0.897*0.076+0.126*0.787=0.167
v(Doc1)•v(Doc3)=0.696
v(Doc2)•v(Doc3)=0.478
45
Introduction to Information Retrieval
Cosine: Example2(N=1,000,000)
v(q,Doc1)=0*2.3+0.34*0+0.52*2+0.78*6=5.72
Length[Doc1]=6.73, scores[Doc1]=0.85
v(q,Doc2)=0*4.6+0.34*1.3+0.52*4.0+0.78*0=2.52
Length[Doc2]=6.23, scores[Doc2]=0.40
v(q,Doc3)=0*2.3+0.34*2.6+0.52*10.0+0.78*6.0=10.76
Length[Doc3]=12.17, scores[Doc3]=0.88
Ranking=Doc3, Doc1, Doc2.
46
Introduction to Information Retrieval
Computing the cosine score
47
Introduction to Information Retrieval
Components of tf-idf weighting
48