Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Document Organization using Self – Organizing Feature Maps (WEBSOFM) Apostolos Georgakis Artificial Intelligence and Information Analysis Lab Department of Informatics Aristotle University of Thessaloniki Self-Organizing Maps Algorithm SOMs are neural networks with: •Two-layer structure •Feed-forward topology They: •form a non-linear projection from an arbitrary data manifold onto a low dimensional discrete map. •perform competitive, unsupervised training •map the probability density function (pdf) of the input space on a 2D or 3D lattice The topology of the lattice can be either hexagonal or orthogonal Department of Informatics Aristotle University of Thessaloniki SOMs belong to the class of vector quantizers (k-means, VQ, LVQ, etc) Let xi(t)=[xi1(t), xi2(t),…, xim(t)]T denote an m1 feature vector and mj(t) denote the weight vector of the j-th neuron. xi (feature vectors) mj (weight vector) neurons Department of Informatics Aristotle University of Thessaloniki Index of the winning neuron: c arg min xi t m j t j Updating function: m j t hcj t xi t m j t , j Nc m j t 1 , j N c m j t where Nc is the neighbourhood centered on the winner which is modified toward xi(t) Department of Informatics Aristotle University of Thessaloniki r r 2 c j hcj t a t exp 2 2 t a(t) is the learning rate, and σ(t) denotes the diameter of the updating kernel. rc and rj is the location on the lattice of the winner neuron and the neuron being updated, respectively. Department of Informatics Aristotle University of Thessaloniki Information Retrieval of Domain Specific Material Information retrieval (IR) consists of: •Organization representation and storage of the available data •Retrieval exploration of the organized data Prior to retrieval, the data repository (corpus) has to be organized according to the retrieval method to be applied. Without a retrieval-oriented organization method the retrieval of even a fraction of the available documents becomes onerous. Department of Informatics Aristotle University of Thessaloniki Information Retrieval of Domain Specific Material • Scenario: A user is interested in locating documents or web pages related to a certain topic. Query the IR system in Natural Language or use available sources (documents and / or web pages) Department of Informatics Aristotle University of Thessaloniki Information Retrieval • The above problem can be dealt either by the: Boolean model Index terms Vector model Vectors in t-dimensional space Probabilistic model Probability theory Department of Informatics Aristotle University of Thessaloniki Boolean Model Documents: represented by index terms or keywords • Indexing techniques: Suffix arrays (text suffix: a string that goes from a text position to the end of the text) Signature files (word oriented index structures based on hashing) Inverted files Vocabulary file Occurrences file • Boolean operators are used in the retrieval. Department of Informatics Aristotle University of Thessaloniki • Drawbacks of the Boolean model It does not support significance-weighting of the query terms. Users are not always familiar with the Boolean operators. No sufficient mechanism exists for ranking the retrieved documents. Capable of handling only keyword-based queries. • Alternatives Fuzzy Boolean operators. Department of Informatics Aristotle University of Thessaloniki Vector Space Model • The words and the documents are represented by vectors. • Advantages Supports term-weighting. The retrieved documents can be sorted according to their similarity with the help of the various vector norms. It is a user-friendly model. Department of Informatics Aristotle University of Thessaloniki Preprocessing • Text processing Html cleaning Plain text cleaning (Email addresses, URLs, word separators, numbers etc) • Stemming Clustering by elimination of word suffixes • Feature Vector Extraction Calculation of the contextual statistics Formation of the feature vectors (Vector space model) Department of Informatics Aristotle University of Thessaloniki Language Modeling (Contextual statistics ) • Consider both its preceding and following words. Nw nli el l 1 l i 1 1 e xi i Ni Nw nim em m 1 m i xi 1 Ni nli el l or Department of Informatics Aristotle University of Thessaloniki The word categories map for a 27x27 network Department of Informatics Aristotle University of Thessaloniki Department of Informatics Aristotle University of Thessaloniki k-th document A Prague Guide - Czech Restaurants - Prag - Praha - From Andel 3W Tourist Service English.htm PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . folklor musican realli get live mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS pick histori . reserv recommend . back list NEBOZIZEK NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . …. Combined together and apply smoothing Department of Informatics Aristotle University of Thessaloniki Document Map Department of Informatics Aristotle University of Thessaloniki Document Map Department of Informatics Aristotle University of Thessaloniki Mean Square Error The solid vertical line denotes the time instant when the simulated annealing process was used to overcome local minima in the clustering procedure. Department of Informatics Aristotle University of Thessaloniki Dimensionality reduction The dimensionality of the feature vectors is exceptionally high, i.e. for the HyperGeo corpus (393 web pages, 3084 stems) the dimension is 3084*3=9252. Dimensionality reduction techniques: •PCA, ICA, LDA •Random projection •Component elimination Department of Informatics Aristotle University of Thessaloniki Random Projection In random projection we compute the m*3Nw matrix R having the following properties: •The components in each column are chosen to be independent, identically distributed Gaussian variables with zero mean and unit variance. •Each column is normalized to unit norm. xi Rm3Nw xi Department of Informatics Aristotle University of Thessaloniki Component Elimination To overcome the computational complexity in locating the winner we order the components of the feature vectors according to the distance: k uj x ij i 1 t E j x t 2 where k 1 E j x t xij t k i 1 and k is the number of feature vectors. Department of Informatics Aristotle University of Thessaloniki Rearrange the components in each feature vector using uj so that the components with the strongest values appear first. The Euclidean distance is then computed by xi t m j t d x in n 1 t m jn t 2 m x in n d 1 t m jn t By omitting the second sum we still can get an estimate of the winning neuron. Department of Informatics Aristotle University of Thessaloniki 2 Percentage of the correctly identified winners with respect to the dimensionality difference between the original space and the “truncated” sub-space. Department of Informatics Aristotle University of Thessaloniki References 1. 2. 3. 4. 5. 6. 7. R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval (ACM, 1999). M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM, 1999). G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (McGraw Hill, 1983). S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the WEBSOM System in Organizing Document Collections, Computing Science and Statistics, 29, 1998, 281-290. T. Kohonen, Self-Organizing Maps (Springer Verlag, 1997). T. Kohonen, Self-organization of very large document collections: State of the art, Proc. of the 8th Int. Conf. on Artificial Neural Networks, 1, Springer, 1998, 65-74. H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. Cybernetics, 61, 1989, 241-254. Department of Informatics Aristotle University of Thessaloniki 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. S. Haykin, Neural Networks, A Comprehensive Foundation (Prentice-Hall, Inc, 1999). T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, IEEE Trans. On Neural Networks, 11:3, 2000, 574-585. S. Kaski, Data exploration using self-organizing maps, PhD Thessis, Helsinki University of Technology, 1997. D. Manning and H. Schutze, Foundation of Statistical Natural Language Processing (MIT Press, 1999). W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and Algorithms (Prentice-Hall, Inc, 1992). M. F. Porter, An algorithm for suffix stripping, Program, 14, 1980, 130-137. C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++ Implementation (Wiley, 1998). M. P. Oak, Statistics for Corpus Linguistics (University Press, 1998). S. Kaski, Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proc. of the 8th Int. Conf. on Neural Networks, IEEE, 1, 1998, 413-418. S. Kaski, Fast winner search for SOM-based monitoring and retrieval of highdimensional data, Proc. of the 9th Int. Conf. on Artificial Neural Networks, Edinburgh, U.K., September 1999. Department of Informatics Aristotle University of Thessaloniki