Download Artificial Immune Systems: An Emerging Technology

Document Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems" Agenda  Motivation  What is a document map  Map creation  Clustering  Experimental results  Future directions Motivation  The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore  A good way of presenting massive document sets in an understandable form will be crucial in the near future Document map  Many attempts have been made to visualize sets of dicuments not just like a list, but rather in two dimensions  A document map is a mapping of a set of documents to 2-D representing their interrelationships Linear relationship presentation (Internet Cartographer) A relationship  A link between hypertext documents  Citation in the bibliography  Content similarity A tree of relations with central subject (Inxight – Tree Studio ) Selforganizing map (WebSOM) dissimilarity of grouops of documents Document frequency in clusters A meta search engine map Our approach – multiple representations (BEATCA) Map visualizations in 3D (BEATCA) Future research – hypergeometric representation (Fish-Eye eEffect) Processing Flow Diagram - BEATCA Search Engine DB REGISTRY Spider INTERNET Downloading HT-Base Do Spid wn er loa di n Indexing + Optimizing VEC-Base Clustering of docs DocGR -Base Mapping MAPBase Clustering of cells CellGRBase ........ ........ g ........ ........ ........ HT-Base ........  The preparation of documents is done by an indexer, which turns a document into a vector-space model representation  Indexer also identifies frequent phrases in document set for clustering and labelling purposes  Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded  The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation  ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query How are the maps created  A modified WebSOM method is used: – compact reference vectors representation – broad-topic initialization method – joint winner search method – multi-level (hierarchical) maps – multi-phase document clustering: • • • • initial grouping to identify major topics Initial document grouping WEBSOM on document groups fuzzy cell clusters extraction and labelling My dog likes Document model in search this food engines  In the so-called vector model a document is considered as a vector in space spanned by the words it contains. dog food When walking, I take some food walk Document model in search engines  The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document. dog food Query: walk walk Reference vector representation  Vectors are sparse by nature  During learning process they become even sparser  Represented as a balanced red-black trees  Tolerance threshold imposed  Terms (dimensions) below threshold are removed  Significant complexity reduction without negative quality impact Topic-sensitive initialization  Inter-topic similarities important both for map learning and visualization/cluster extraction  Simple approach: – Use LSI to select K main broad topics – Select K map cells (evenly spread over the map) as the fixpoints for individual topics – Initialize selected fixpoints with broad topics – Initialize remaining cells with „in-between values” Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar Joint winner search  Global winner search: accurate but slow  Local winner search: faster but can be inaccurate during rapid changes  Start with single phase of global search  Document movements become more smooth during learning process: usually local search is enough  Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease) Hierarchical maps  Bottom-up approach  Feasible (with joint winner search method)  Start with most detailed map  Compute weighted centroids of map areas  Use them as seeds for coarser map  Top-down approach is possible but requires fixpoints Clustering document groups  Numerous methods exists but none of them directly applicable: – Extremely fuzzy structure of topical groups in SOM cells – Neccesity of taking into account similiarity measures both in original document space and in the map space – Outlier-handling problem during cluster formation – No a priori estimation of the number of topical groups  Fuzzy C-MEANS on lattice of map cells applied  Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering  Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy Experiments with map convergence  We examined the convergence of the maps to a stable state depending on: – type of alpha function (search radius reduction) – type of winner search method – type of initialization method Convergence – alpha functions (linear versus reciprocal) Convergence – winner search (joint versus local) Experiments with execution time  The impact of the following factors on the speed of map creation was investigated: – Map size (total number of cells) – Optimization methods: • dictionary optimization • reference vector representation  Map quality assessment: – Compare with ‘ideal’ map (e.g. without optimizations) – Identical initialization and learning parameters – Compute sum of squared distances of location of each document on both maps Execution time - map size Execution time - optimizations Future research  Maps for joint term-citation model, taking into account between-group link flow direction  Fully distributed map creation  Adaptive document retrieval and clustering: – Bayesian network based relevance measure – Survival models for document update rate estimation – Dead link propagation methods for page freshness estimation  We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects Future research  Bayesian networks will be applied in particular to: – measure relevance and classify documents – accelerate document clustering processes – construct a thesaurus supporting query enrichment – keyword extraction – between-topic dependencies estimation Thank you! Any questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Artificial Immune Systems: An Emerging Technology