Download Presentation title - Budapest University of Technology and

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu Introduction  Generic model – Document preprocessing – Text mining methods Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 2 Text Mining Tasks  Classification (supervised learning) – – – – Binary classification Single label (multi-class) classification Multi-label classification Multi-level (hierarchical) classification  Clustering (unsupervised learning)  Summarization – Extraction: only parts of the original text – Abstraction: introduces text that is not included in the original text Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 3 Solutions  Classification – Decision tree – Neural network – Bayes network  Clustering – k-means Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 4 Document preprocessing  Goal: represent any text briefly, in a fixed number of parameters  Representation: vector space model Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 5 Vector space model  The text is tokenized to words  The words are canonized to base words we refer to base words as terms  A dictionary is built, that is the set of the terms in the document  The document is represented as a vector: the ith element of the vector is the number the ith term of the dictionary occurs in the document  The collection of documents is represened in the termdocument matrix  Problem: the number of dimensions is too large Solution: feature selection Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 6 Dimension Reduction  Feature Selection: find a subset of original variables – Document Frequency Thresholding • Omit the words with occurences greater than a threshold value, because these words are not discriminative • Omit the words with occurences less then a threshold value, because these words do not carry much information – Information gain based feature selection (information theory) – Chi-square based feature selection (statistics)  Feature Extraction: transform the data to fewer dims – Latent Semantic Indexing (LSI) – Principal Component Analysis (PCA) – Nonlinear methods Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 7 Latent Semantic Indexing (LSI)  SVD is applied to the term-document matrix  The features belonging to the k largest eigenvalues represent the term-document matrix well, these features are used  LSI regards documents with many common words as being semantically near Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 8 Principal Component Analysis (PCA)  Also called Karhunen-Loève transform (KLT)  A linear technique  Maps the data to a lower dimensional space in a way that the variance in the low-dimensional representation is maximized  The algorithm – The correlation matrix of the data is constructed – The eigenvectors and eigenvalues of the correlation matrix are calculated – The original space is reduced to the space spanned by the eigenvectors that belong to the largest eigenvalues Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 9 Kernel PCA  A nonlinear method  PCA + kernel trick  Kernel trick (generally) – we map observations from a general set S into a higher dimensional space V – we hope that the general classification in S reduces to the linear classification in V – the trick lets us avoid the calculation of mapping the observations from S to V • We use a learning algorithm that needs only the dot product operation in V • We use a mapping that allows to calculate the dot product within V by a kernel function K within S (the original space) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 10 Manifold learning techniques  they minimize a cost function that retains local properties of the data  methods – – – – – Locally Linear Embedding (LLE) Hessian LLE Laplacian Eigenmaps Local tangent space alignment (LTSA) Maximum Variance Unfolding (MVU) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 11 Locally Linear Embedding (LLE) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 12 Locally Linear Embedding (LLE) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 13 Maximum Variance Unfolding (MVU)  instead of defining a fixed kernel, it tries to learn the kernel using semidefinite programming  exactly preserves all pairwise distances between nearest neighbors  maximizes the distances between points that are not nearest neighbors Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation title - Budapest University of Technology and