Download Presentation title - Budapest University of Technology and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Text mining
Gergely Kótyuk
Laboratory of Cryptography and System Security (CrySyS)
Budapest University of Technology and Economics
www.crysys.hu
Introduction
 Generic model
– Document preprocessing
– Text mining methods
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
2
Text Mining Tasks
 Classification (supervised learning)
–
–
–
–
Binary classification
Single label (multi-class) classification
Multi-label classification
Multi-level (hierarchical) classification
 Clustering (unsupervised learning)
 Summarization
– Extraction: only parts of the original text
– Abstraction: introduces text that is not included in the original text
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
3
Solutions
 Classification
– Decision tree
– Neural network
– Bayes network
 Clustering
– k-means
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
4
Document preprocessing
 Goal: represent any text briefly, in a fixed number of
parameters
 Representation: vector space model
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
5
Vector space model
 The text is tokenized to words
 The words are canonized to base words
we refer to base words as terms
 A dictionary is built, that is the set of the terms in the
document
 The document is represented as a vector:
the ith element of the vector is the number the ith term of
the dictionary occurs in the document
 The collection of documents is represened in the termdocument matrix
 Problem: the number of dimensions is too large
Solution: feature selection
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
6
Dimension Reduction
 Feature Selection: find a subset of original variables
– Document Frequency Thresholding
• Omit the words with occurences greater than a threshold value,
because these words are not discriminative
• Omit the words with occurences less then a threshold value, because
these words do not carry much information
– Information gain based feature selection (information theory)
– Chi-square based feature selection (statistics)
 Feature Extraction: transform the data to fewer dims
– Latent Semantic Indexing (LSI)
– Principal Component Analysis (PCA)
– Nonlinear methods
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
7
Latent Semantic Indexing (LSI)
 SVD is applied to the term-document matrix
 The features belonging to the k largest eigenvalues
represent the term-document matrix well, these features
are used
 LSI regards documents with many common words as
being semantically near
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
8
Principal Component Analysis (PCA)
 Also called Karhunen-Loève transform (KLT)
 A linear technique
 Maps the data to a lower dimensional space in a way that
the variance in the low-dimensional representation is
maximized
 The algorithm
– The correlation matrix of the data is constructed
– The eigenvectors and eigenvalues of the correlation matrix are
calculated
– The original space is reduced to the space spanned by the
eigenvectors that belong to the largest eigenvalues
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
9
Kernel PCA
 A nonlinear method
 PCA + kernel trick
 Kernel trick (generally)
– we map observations from a general set S into a higher
dimensional space V
– we hope that the general classification in S reduces to the linear
classification in V
– the trick lets us avoid the calculation of mapping the observations
from S to V
• We use a learning algorithm that needs only the dot product
operation in V
• We use a mapping that allows to calculate the dot product within V
by a kernel function K within S (the original space)
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
10
Manifold learning techniques
 they minimize a cost
function that retains
local properties of
the data
 methods
–
–
–
–
–
Locally Linear Embedding (LLE)
Hessian LLE
Laplacian Eigenmaps
Local tangent space alignment (LTSA)
Maximum Variance Unfolding (MVU)
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
11
Locally Linear Embedding (LLE)
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
12
Locally Linear Embedding (LLE)
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
13
Maximum Variance Unfolding (MVU)
 instead of defining a fixed kernel, it tries to learn the
kernel using semidefinite programming
 exactly preserves all pairwise distances between nearest
neighbors
 maximizes the distances between points that are not
nearest neighbors
Laboratory of Cryptography and System Security
CrySyS Adat- és Rendszerbiztonság Laboratórium
www.crysys.hu
14