Download Artificial Immune Systems: An Emerging Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Document Maps
Slawomir Wierzchon , Mieczyslaw Klopotek
Michal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer Science, Polish Academy of Sciences
Warsaw
Research partially supported by the KBN research project 4 T11C
026 25 "Maps and intelligent navigation in WWW using
Bayesian networks and artificial immune systems"
Agenda
 Motivation
 What is a document map
 Map creation
 Clustering
 Experimental results
 Future directions
Motivation
 The Web as well as intranets become
increasingly content-rich: simple ranked
lists or even hierarchies of results seem
not to be adequate anymore
 A good way of presenting massive
document sets in an understandable
form will be crucial in the near future
Document map
 Many attempts have been made to visualize
sets of dicuments not just like a list, but rather
in two dimensions
 A document map is a mapping of a set of
documents to 2-D representing their interrelationships
Linear relationship presentation
(Internet Cartographer)
A relationship
 A link between hypertext documents
 Citation in the bibliography
 Content similarity
A tree of relations with central
subject (Inxight – Tree Studio )
Selforganizing map (WebSOM)
dissimilarity of grouops of
documents
Document frequency in clusters
A meta search engine map
Our approach – multiple
representations (BEATCA)
Map visualizations in 3D
(BEATCA)
Future research –
hypergeometric representation
(Fish-Eye eEffect)
Processing Flow Diagram - BEATCA
Search Engine
DB
REGISTRY
Spider
INTERNET Downloading HT-Base
Do Spid
wn er
loa
di n
Indexing +
Optimizing VEC-Base
Clustering
of docs
DocGR
-Base
Mapping
MAPBase
Clustering
of cells
CellGRBase
........
........
g
........
........
........
HT-Base
........
 The preparation of documents is done by an indexer, which
turns a document into a vector-space model representation
 Indexer also identifies frequent phrases in document set for
clustering and labelling purposes
 Subsequently, dictionary optimization is performed - extreme
entropy and extremely frequent terms excluded
 The map creator is applied, turning the vector-space
representation into a form appropriate for on-the-fly map
generation
 ‘The best’ (wrt some similarity measure) map is used by the
query processor in response to the user’s query
How are the maps created
 A modified WebSOM method is used:
– compact reference vectors representation
– broad-topic initialization method
– joint winner search method
– multi-level (hierarchical) maps
– multi-phase document clustering:
•
•
•
•
initial grouping to identify major topics
Initial document grouping
WEBSOM on document groups
fuzzy cell clusters extraction and labelling
My dog likes
Document model in search
this food
engines
 In the so-called
vector model a
document is
considered as a
vector in space
spanned by the
words it contains.
dog
food
When walking, I
take some food
walk
Document model in search
engines
 The relevance of
a document to a
query or to
another document
is measured as
cosine of angle
between the query
and the
document.
dog
food
Query: walk
walk
Reference vector representation
 Vectors are sparse by nature
 During learning process they become even
sparser
 Represented as a balanced red-black trees
 Tolerance threshold imposed
 Terms (dimensions) below threshold are removed
 Significant complexity reduction without
negative quality impact
Topic-sensitive initialization
 Inter-topic similarities important both for map
learning and visualization/cluster extraction
 Simple approach:
– Use LSI to select K main broad topics
– Select K map cells (evenly spread over the map) as the
fixpoints for individual topics
– Initialize selected fixpoints with broad topics
– Initialize remaining cells with „in-between values”
Clustering document vectors
r
x
m
Mocna zmiana
położenia (gruba
strzałka)
Document space
2D map
Important difference to general clustering: not only clusters
with similar documents, but also neighboring clusters similar
Joint winner search
 Global winner search: accurate but slow
 Local winner search: faster but can be inaccurate
during rapid changes
 Start with single phase of global search
 Document movements become more smooth
during learning process: usually local search is
enough
 Use global search when occassional sudden
moves occur (eg. outliers, neighbourhood width
decrease)
Hierarchical maps
 Bottom-up
approach
 Feasible (with joint
winner search
method)
 Start with most
detailed map
 Compute weighted
centroids of map
areas
 Use them as seeds
for coarser map
 Top-down approach
is possible but
requires fixpoints
Clustering document groups
 Numerous methods exists but none of them
directly applicable:
– Extremely fuzzy structure of topical groups in SOM
cells
– Neccesity of taking into account similiarity measures
both in original document space and in the map space
– Outlier-handling problem during cluster formation
– No a priori estimation of the number of topical groups
 Fuzzy C-MEANS on lattice of map cells applied
 Graph theoretical approach (density- and
distance- based MST) combined with fuzzy
clustering
 Clustered documents are labeled by weighted
centroids of cell reference vectors scaled with
between-group entropy
Experiments with map
convergence
 We examined the convergence of the maps
to a stable state depending on:
– type of alpha function (search radius
reduction)
– type of winner search method
– type of initialization method
Convergence – alpha functions
(linear versus reciprocal)
Convergence – winner search
(joint versus local)
Experiments with execution
time
 The impact of the following factors on the speed
of map creation was investigated:
– Map size (total number of cells)
– Optimization methods:
• dictionary optimization
• reference vector representation
 Map quality assessment:
– Compare with ‘ideal’ map (e.g. without optimizations)
– Identical initialization and learning parameters
– Compute sum of squared distances of location of each
document on both maps
Execution time - map size
Execution time - optimizations
Future research
 Maps for joint term-citation model, taking into
account between-group link flow direction
 Fully distributed map creation
 Adaptive document retrieval and clustering:
– Bayesian network based relevance measure
– Survival models for document update rate estimation
– Dead link propagation methods for page freshness
estimation
 We also intend to integrate Bayesian and immune
system methodologies with WebSOM in order to
achieve new clustering effects
Future research
 Bayesian networks will be applied in
particular to:
– measure relevance and classify documents
– accelerate document clustering processes
– construct a thesaurus supporting query
enrichment
– keyword extraction
– between-topic dependencies estimation
Thank you!
Any
questions?