Download Chapter 1 - WordPress.com

Document related concepts

Gene expression programming wikipedia , lookup

Agent-based model in biology wikipedia , lookup

Personal knowledge base wikipedia , lookup

Mixture model wikipedia , lookup

Neural modeling fields wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Mathematical model wikipedia , lookup

Time series wikipedia , lookup

Stemming wikipedia , lookup

Latent semantic analysis wikipedia , lookup

Transcript
Chapter 2
Information
Retrieval
Chapter2 in the textbook
Sections: 2.1, 2.2 (2.2.1, 2.2.2), 2.3
1 (2.3.1, 2.3.2, 2.3.3), 2.4(2.4.1, 2,4,2)
2
Modern Information Retrieval
 Document


Using keywords
Relative weight of keywords
 Query


representation
Keywords
Relative importance of keywords
 Retrieval

representation
model
Similarity between document and query
 Rank
the documents
 Performance evaluation of the retrieval
process
3
Document Representation
Transforming a text
document to a weighted list
of keywords
4
Stopwords
Figure 2.2 A partial list of stopwords
5
Sample Document
Data Mining has emerged as one of the most exciting and dynamic
fields in computing science. The driving force for data mining is
the presence of petabyte-scale online archives that potentially
contain valuable bits of information hidden in them. Commercial
enterprises have been quick to recognize the value of this
concept; consequently, within the span of a few years, the
software market itself for data mining is expected to be in
excess of $10 billion. Data mining refers to a family of
techniques used to detect interesting nuggets of
relationships/knowledge in data. While the theoretical
underpinnings of the field have been around for quite some time
(in the form of pattern recognition, statistics, data analysis
and machine learning), the practice and use of these techniques
have been largely ad-hoc. With the availability of large
databases to store, manage and assimilate data, the new thrust of
data mining lies at the intersection of database systems,
artificial intelligence and algorithms that efficiently analyze
data. The distributed nature of several databases, their size and
the high complexity of many techniques present interesting
computational challenges.
6
List of words in d1 after
deleting stopwords
7
Stemming
A given word may occur in a variety of
syntactic forms



plurals
past tense
gerund forms (a noun derived from a verb)
Example
The word connect, may appear as

connector, connection, connections, connected,
connecting, connects, preconnection, and
postconnection.
8
Stemming
A stem is what is left after its affixes (prefixes
and suffixes) are removed
Suffixes
 connector, connection, connections,
connected, connecting, connects,
Prefixes
 preconnection, and postconnection.
Stem
 connect
9
Porter’s Algorithm



Letters A, E, I, O, and U are vowels
A consonant in a word is a letter other than A,
E, I, O, or U, with the exception of Y
The letter Y is a vowel if it is preceded by a
consonant, otherwise it is a consonant


For example, Y in synopsis is a vowel, while in
toy, it is a consonant
A consonant in the algorithm description is
denoted by c, and a vowel by v
10
Porter’s Algorithm

m is the measure of vc repetition







m=0
m=1
m=2
TR, EE, TREE, Y, BY
TROUBLE, OATS, TREES, IVY
TROUBLES, PRIVATE, OATEN, ORRERY
*S – the stem ends with S (Similarly for other letters)
*v* - the stem contains a vowel
*d – the stem ends with a double consonant (e.g., -TT)
*o – the stem ends cvc, where the seconds c is not W, X,
or Y (e.g. -WIL)
Porter’s
algorithm
Step 1
Step 1:
plurals and
past
participles
11
12
Porter’s algorithm - Step 2
Steps 2–4: straightforward stripping of
suffixes
13
Porter’s algorithm
Step 3
Steps 2–4: straightforward stripping of
suffixes
14
Porter’s algorithm
Step 4
Steps 2–4: straightforward stripping of
suffixes
15
Porter’s algorithm
Step 5
Steps 5: tidying-up
16
Example

generalizations





Step1: GENERALIZATION
Step2: GENERALIZE
Step3: GENERAL
Step4: GENER
OSCILLATORS




Step1: OSCILLATOR
Step2: OSCILLATE
Step4: OSCILL
Step5: OSCIL
17
Porter’s algorithm
Suffix stripping of a vocabulary of 10,000 words
(http://www.tartarus.org/~martin/)
18
Document
Representation
19
Term-Document Matrix
•
•
•
•
Term-document matrix (TDM) is a twodimensional representation of a document
collection.
Rows of the matrix represent various
documents
Columns correspond to various index terms
Values in the matrix can be either the
frequency or weight of the index term
(identified by the column) in the document
(identified by the row).
20
Term-Document matrix
21
Sparse Matrixes- triples
22
Sparse Matrixes- Pairs
23
Normalization
•
raw frequency values are not useful for a
retrieval model
•
prefer normalized weights, usually between 0
and 1, for each term in a document
•
dividing all the keyword frequencies by the
largest frequency in the document is a simple
method of normalization:
24
Normalized Term-Document
Matrix
25
Vector Representation of
document d1
(word, frequency, normalized frequency)
26
Retrieval models
Retrieval models match query
with documents to:
 separate
documents into relevant
and non-relevant class
 rank the documents according to
the relevance
27
Retrieval models
Boolean
model
Vector space model (VSM)
Probabilistic models
28
Boolean Retrieval Model
29
Boolean Retrieval Model





One of the simplest and most efficient
retrieval mechanisms
Based on set theory and Boolean algebra
Conventional numeric representations of false
as 0 and true as 1
Boolean model is interested only in the
presence or absence of a term in a
document
In the term-document matrix replace all the
nonzero values with 1
30
Boolean Term-document
Matrix
31
Example
Document set
 DocSet(K0) = {D1,D3,D5}
 DocSet(K4)={D2,D3,D4,D6}
Query
 K0 and K4
 K0
or K4
32
K0 or (not K3 and K5)
33
Boolean Query
 User
Boolean queries are usually simple
Boolean expressions
 A Boolean query can be represented
in a “disjunctive normal form” (DNF)



disjunction corresponds to or
conjunction refers to and
DNF consists of a disjunction of conjunctive
Boolean expressions
34
DNF form
 K0
or (not K3 and K5) is in DNF
 DNF query processing can be very
efficient
 If any one of the conjunctive expressions is
true, the entire DNF will be true
 Short-circuit the expression evaluation
 Stop matching the expression with a
document as soon as a conjunctive
expression matches the document; label
the document as relevant to the query
35
Boolean Model
Advantages
 Simplicity
 Binary
and efficiency of implementation
values can be stored using bits

reduced storage requirements

retrieval using bitwise operations is efficient
 Boolean
retrieval was adopted by many
commercial bibliographic systems
 Boolean
queries
queries are akin to database
36
Boolean Model
Disadvantages
A
document is either relevant or non-relevant
to the query
 It is not possible to assign a degree of
relevance
 Complicated Boolean queries are difficult for
users
 Boolean queries retrieve too few or too many
documents.


K0 and K4 retrieved only 1 out of 6 documents
K0 or K4 retrieved 5 out of a possible 6 documents
37
Vector Space Model
(VSM)
38
Vector Space Model
 Treats
both the documents and queries as
vectors
A
weight based on the frequency in the
document:
39
Graphical representation of the VSM
Model
40
41
Computing the similarity
42
Relevance Values and Ranking
Ranking
D0 (0.7774)
D6 (0.4953)
D2 (0.3123)
D1 (0.2590)
D5 (0.2122)
D4 (0.1727)
D3 (0.1084)
43
Variations of VSM
 Variations
of the normalized frequency
 Inverse document frequency (idf)
 N = no. of documents
 nj = no. of documents containing jth term
 Modified weights :
44
Inverse Document Frequencies
for Collection (normalized)
7
idf 0  idf1  idf 2  idf3  log  0.368
3
7
idf 4  idf 5  idf 6  log  0.243
4
45
TDM using idf
46
q  (0,0.2,0.6,0,0.2,0.3,0)
Ranking
D0 (0.7867)
D6 (0.4953)
D2 (0.3361)
D1 (0.2590)
D5 (0.2215)
D4 (0.1208)
D3 (0.0969)
47
VSM vs. Boolean
 Queries
are easier to express: allow users to
attach relative weights to terms
A
descriptive query can be transformed to a
query vector similar to documents
 Matching
between a query and a document
is not precise: document is allocated a degree
of similarity
 Documents
are ranked based on their similarity
scores instead of relevant/non-relevant classes
 Users
can go through the ranked list until their
information needs are met.
48
Evaluation of Retrieval
Performance
49
Evaluation of Retrieval
Performance
Evaluation should include:
 Functionality
 Response
time
 Storage requirement
 Accuracy
50
Accuracy Testing
Early days:
 Batch
testing
 Document collection such as cacm.all
 Query collection such as query.text
Present day: interactive tests are used
 Difficult
 Batch
to conduct and time consuming
testing still important
51
Precision and Recall
Precision
How many from the retrieved are relevant?
Recall
How many from the relevant are retrieved?
52







Our earlier example illustrating the VSM
o Documents from Fig. 2.15
o query q  (0,0.2,0.6,0,0.2,0.3,0)
Ranking
1. D0* 2. D6
3. D2*
4. D1
5. D5* 6. D4
7. D3*
Semantic analysis: documents with asterisk as relevant
Retrieved the three top ranked documents
Relevant documents: R  {D0, D2, D5, D3}
Retrieved documents: A  {D0, D6,D2}
R A  {D0, D2}
R A
{D0,D2}
2
precision 

  0.67
A
{D0,D6,D2} 3
recall 
R
A
R

{D0,D2}
{D0,D2,D5,D3}

2
 0.5
4
53
F-measure
precision  recall
2  precision  recall
F

precision  recall
 precision  recall 


2


2  precision  recall 2  0.67  0.5 0.67
F


 0.57
precision  recall
0.67  0.5
1.17
54
Average Precision
Three retrieved document was arbitrary
Rank retrieved
1
2
3
4
5
6
7
Precision
1.00
0.50
0.67
0.50
0.60
0.50
0.57
Recall
0.25
0.25
0.50
0.50
0.75
0.75
1.00
55
Relationship between precision
and recall
56
Average Precision
N
 precision(i)  relevance(i)
Average Precision =
i 1
R
1.00 1  0.50  0  0.67 1  0.50  0  0.60 1  0.50  0  0.57 1
4
2.84

4
 0.71
Average Precision =