Download No Slide Title

Document related concepts

Ecological interface design wikipedia , lookup

Memex wikipedia , lookup

Human–computer interaction wikipedia , lookup

Personal knowledge base wikipedia , lookup

Personal information management wikipedia , lookup

World Wide Web wikipedia , lookup

Latent semantic analysis wikipedia , lookup

Semantic Web wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Transcript
Applying Evolutionary Computation
Techniques to Web Information Retrieval
Chih-Chin Lai, Ph.D.
Dept. of Computer Science and Information Engineering
National University of Tainan, Taiwan
E-mail: [email protected]
Nov. 28, 2007
Outlines
• Information Retrieval
– some related topics
• Evolutionary Computation (EC)
• Applying EC to Web Information Retrieval
• Conclusions
2
Introduction
• Definition of Information Retrieval
– Salton (1989): Information-retrieval systems process files of
records and requests for information, and identify and retrieve
from the files certain records in response to the information
requests. The retrieval of particular records depends on the
similarity between the records and the queries, which in turn is
measured by comparing the values of certain attributes to records
and information requests.
– Kowalski (1997): An Information Retrieval System is a system
that is capable of storage, retrieval, and maintenance of
information. Information in this context can be composed of text
(including numeric and date data), images, audio, video, and other
multi-media objects).
3
Introduction (cont.)
• Information Retrieval (IR)
– The indexing and retrieval of textual documents
• Searching newspaper articles
• Searching on the Web
– Concerned firstly with retrieving relevant
documents to a query
– Concerned secondly with retrieving large sets
of documents efficiently
4
Typical IR Task
•
Given
–
–
–
•
User has information need
A corpus of textual natural-language
documents
A user query in the form of a textual string
Find
–
A ranked set of documents that are relevant to
the query
5
Key Qualities
•
•
•
Document and query representations
Mechanisms for finding relevant
documents and ranking the results
Mechanisms for obtaining user feedback
6
Typical IR System
User
Process
Process
Retrieved
relevant(?)
documents
Store
Retrieval Part
7
IR System
Relevance
• Relevance is a subject judgment
– Being on the proper subject
– Being timely (recent information)
– Satisfying the goals of the user and his/her
intended use of the information (information
need)
8
IR System Components
• Text operations forms index words (tokens)
– Stopword removal
– Stemming
• Indexing maps each keyword to a set of
documents that contains the keyword
• Searching retrieves documents that contain a
given query token from the inverted index
• Ranking scores all retrieved documents
according to a relevance metric
9
IR System Components (cont.)
• User interface manages interaction with the
user
– Query input and document output
– Relevance feedback
– Visualization of results
• Query operations transform the query to
improve retrieval
10
Examples of IR System
• Conventional (library catalog): Search by keyword, title, author, etc.
11
Examples of IR System (cont.)
• Text-based (Google): Search by keywords. Limited search using
queries in natural language
12
Examples of IR System (cont.)
• Multimedia (WebSeek): Search by visual appearance (shapes, colors,…)
13
Examples of IR System (cont.)
• Question answering systems (AnswerBus): Search in (restricted) natural
language
14
Searching the Web
• Application of IR to HTML documents on
the World Wide Web
• Three forms
– Use search engines that index a portion of the
Web documents as a full-text database
– Use Web directories, which classify selected
Web documents by subject
– Search the Web exploiting its hyperlink
structure
15
Web Search System
Documents
Spider
User
Process
Process
Retrieved
relevant(?)
documents
Store
World Wide Web
Retrieval Part
16
IR System
Retrieval Models
• A retrieval model specifies the details of:
– Document and Query representation
– Matching strategies for assessing the relevance
of documents to a user query
– Methods for ranking query output
– Mechanisms for acquiring user-relevance
feedback
• Notion of relevance can be binary or
continuous (i.e. ranked retrieval)
17
Types of IR Models
• Boolean model
– Simple Boolean queries regarding existence of
terms within documents
– Easy to understand, but difficult to rank output
• Vector space model
– Documents are represented by n-dimensional
vectors
– Typically one dimension per term
18
Types of IR Models (cont.)
• Probabilistic model
– Start with some user-supplied relevance
information about a “training set” of documents
– The training set is used to compute term weights
by estimating
P(t in document | document is relevant)
P(t in document | document is irrelevant )
– Useful for theoretical analysis, but probably not
in practice (?)
19
Statistical Retrieval
• Retrieval based on similarity between query
and documents
• Output documents are ranked according to
similarity to query
• Similarity based on occurrence frequencies
of keywords in query and document
20
The Vector Space Model
• A document is typically represented by a
bag of words (unordered words with
frequencies)
• Assume a vocabulary of t distinct terms
• Each term, i, in a document or query, j, is
given a real-valued weight, wij
• Both documents and queries are expressed
as t-dimensional vectors
dj = (w1j, w2j, …, wtj)
21
Concept Representation
Example:
T3
Vdoc1 = 2T1 + 4T2 + 5T3
Vdoc2 = 4T1 + 7T2 + T3
5
Vquery = 0T1 + 0T2 + 2T3
Vdoc1 = 2T1+ 4T2 + 5T3
Vquery = 0T1 + 0T2 + 2T3
2
4
T1
Vdoc2 = 4T1 + 7T2 + T3
T2
7
• Is Vdoc1or Vdoc2 more similar to Vquery?
• How to measure the degree of similarity?
22
Term Weights: TF-IDF
• More frequent terms in a document are more indicative
to the topic
fij = frequency of term i in document j
tfij = fij / max{fij} (normalization)
• Terms that appear in many different documents are less
indicative of overall topic
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log(N/ df i) ( where N: total number of documents)
23
TF-IDF Weighting
• A typical combined term importance indicator
is tf-idf weighting
wij = tfij idfi = tfij log (N/ dfi)
• A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight
• Experimentally, tf-idf has been found to work
well
24
Similarity Measure
• A similarity measure is a function that
computes the degree of similarity between
two vectors
• Using a similarity measure between the
query and each document
– to rank the retrieved documents
– to control the size of the retrieved set
25
Similarity Measure (cont.)
t3
• Cosine similarity measures the cosine
of the angle between two vectors inner
product normalized by the vector
t
lengths
 
dj q   (wij  wiq )
 
dj  q  wij   wiq

CosSim(dj, q) =
1
Vdoc1
2
i 1
t
i 1
2
t
i 1
Vquery
t1
2
t2
Vdoc2
Vdoc1 = 2T1 + 4T2 + 5T3 CosSim(Vdoc1 , Vquery) = 10 / (4+16+25)(0+0+4) = 0.75
Vdoc2 = 4T1 + 7T2 + 1T3 CosSim(Vdoc2 , Vquery) = 2 / (16+49+1)(0+0+4) = 0.12
Vquery = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
26
Accuracy Measures: Precision and Recall
not retrieved
Relevant
retrieved
& not retrieved but
documents
relevant
relevant
retrieved &
irrelevant
Not retrieved &
irrelevant
irrelevant relevant
irrelevant relevant
retrieved
retrieved
not retrieved
retrieved &
relevant
not retrieved but
relevant
retrieved &
irrelevant
Not retrieved &
irrelevant
From all the documents that are retrieved by the IR system,
how many are relevant?
precision 
Number of relevant documents retrieved
Total number of retrieved documents
From all the documents that are relevant out there,
how many did the IR system retrieve?
recall 
Number of relevant documents retrieved
Total number of relevant documents
27
Precision and Recall
• Precision
– The ability to retrieve top-ranked documents
that are mostly relevant
• Recall
– The ability of the search to find all of the
relevant items in the corpus
28
Precision and Recall Variations
Narrow query formulation:
Returns relevant documents but
misses many useful ones
The ideal case
Precision
1
Broad query formulation:
Returns most relevant
documents but includes
lots of junk
1
0
Recall
Figure taken from:
Raymond J. Mooney (http://www.cs.utexas.edu/users/mooney/ir-course/)
29
Evolutionary Computation
• Definition
– EC (GA, GP, ES) solve computational
problems by simulating evolution with
natural selection
– They are stochastic search algorithms
which incrementally preserve and
combine desirable features of
individual potential solutions in a
population over an extended period of
time
Figure taken from:
www.genetic-programming.org
30
Template of EC
procedure EC
begin
t := 0;
initializePopulation(P(0));
evaluate(P(0));
repeat
t := t + 1;
P' = selectForVariation((P(t));
recombine(P');
mutate(P');
evaluate(P');
until termination = true;
end
31
Applications of EC to IR
• EC has been applied to the following problems
–
–
–
–
–
–
–
–
Automatic document indexing
Document and term clustering
Query definition
Matching function learning
Image retrieval
Design of user profiles for IR on the Internet
Web page classification
Design of agents for Internet searching
32
MGA for Web Search
• Genetic algorithm
– John Holland, 1975
– David E. Goldberg, 1989
• Metagenetic algorithm (MGA)
– Zacharis and Panayiotopoulos proposed (2001)
– A two-stage GA that controls and optimizes
both populations simultaneously
33
MGA for Web Search (cont.)
• Zacharis and Panayiotopoulos, [IEEE Internet Computing, 2001]
34
Hierarchical Genetic Algorithm
• HGA
– Tang et al. (1998) proposed
– It is a variant of conventional genetic
algorithm with hierarchical genetic structure
– In HGA, the chromosome consists of two
types of genes
• the control genes and
• the parametric genes
• The relationship between parametric genes and
control genes is that the activation of former is
governed by the value of the latter
35
HGA Representation
control genes
parametric genes


 

[1 0 1 1 0 :: 53.2 19.6 34.7 68.2 75.3]
chromosome
i represents parameters (53.2, 34.7, 68.2)
(a)
1st
[
level
control
}
1 0
2nd
genes
::
level
control
genes
parametric
genes
  
          
0 1 0 0 1 1 :: 33.2 78.5 46.8 22.1 94.6 55.4 ]
 0  control
 33 .2

control
control




1
1   78 .5
  control
 0  46 .8
chromosome
 0  control
 22 .1

control
control




0
1   94 .6
  control
1  55 .4
j represents a parameter 78.5
(b)
36
HGA for Web Search
Chromosome
Dictionary
W1 > W2 select
Keyword1
Randomly
generated
Keyword1
Keyword2
Control genes
1
0
1
1
0
1
0
0
Parametric genes
1 | news | intelligence | mit | lab | artificial |
1 | mit | artificial | ai | lab | intelligence |
37
HGA for Web Search (cont.)
Control genes
1
0
1
1
0
1
0
0
1
1
Cut point
Control genes
1 1 1 0 1
0 1 0 0 1
Parametric genes
| news | intelligence | mit | lab | artificial |
| mit | artificial | ai | lab | intelligence |
38
HGA for Web Search (cont.)
Control genes
Parametric genes
1 1 1 0 1 | news | intelligence | mit | lab | artificial |
Dictionary
0
1
0
0
1 | mit | artificial | ai | lab | intelligence |
39
HGA for Web Search (cont.)
Interesting
User interface
Update
PWIS
Vector DD
Relevance page
Recommendation
component
Query
Vector DR
World Wide Web
Keywords
Results by
PageRank
40
HGA for Web Search (cont.)
1
0.9
0.8
0.7
適應值
fitness
0.6
0.5
0.4
0.3
0.2
WRA-Keyword
HGA-Keyword
WRA-Non-Keyword
HGA-Non-Keyword
MGA
GA
0.1
0
1
70
139 208 277 346 415 484 553 622 691 760 829 898 967 1036 1105 1174 1243 1312 1381 1450 1519
染色體
# of chromosomes
41
HGA for Web Search (cont.)
Methods
Fitness
PR
Stability
Time
Score
Rank
HGA-Keyword
WRA
-UserKeyword
2
1
1
1
5
1
HGA-Non-Keyword
4
2
3
1
10
2
MGA
1
4
2
4
11
3
GA
3
3
4
3
13
4
42
Profile for Web Search
43
Profile for Web Search
44
Profile for Web Search
45
Profile for Web Search (cont.)
46
Profile for Web Search (cont.)
47
Profile for Web Search (cont.)
48
Conclusions
• The aim of a Web IR system is to estimate the relevance
of web information items to a user information need
expressed in a query
– This is a very hard and complex task
– It is pervaded with subjectivity, vagueness and
impression
• The main characteristic of EC is that it is tolerant to
impression, vagueness, partial truth, and approximation
– EC techniques have been used satisfactorily to improve
IR process
49
Conclusions (cont.)
Figure taken from: M. Henzinger, “The past, presence, and future of Web Information Retrieval”
50
Web Intelligence
• Today's search engines are designed for human consumption:
(1) A user queries the SE and gets relevant pages
(2) The user reads the pages and extracts manually the information
(3) The information must be integrated to produce the desired
knowledge
(1)
(1)
(2)
(3)
(3)
51
Figure taken from: Prof. F. Ciravegna, University of Sheffield, “Web Intelligence”
Web Intelligence (cont.)
• The future web will have semantics associated to pages
and SE will be able to provide semantically-based services
Figure taken from: Prof. F. Ciravegna, University of Sheffield, “Web Intelligence”
52
References: Journals
• Information Processing and Management
• Journal of the American Society of
Information Science
• Transactions On Information Science
• Information Retrieval
• Journal of Documentation
• Information Retrieval
53
Good books
• Van Rijsbergen
– “Information Retrieval”, ir.dcs.gla.ac.uk
• Sparck Jones and Willett
– “Readings in Information Retrieval”
• Baeza-Yates and Ribeiro-Neto
– “Modern Information Retrieval”
• Witten, Moffat and Bell
– “Managing Gigabytes”
54