Download Feature Extraction

Text mining The Standard Data Mining process Text Mining • • • • Machine learning on text data Text Data mining Text analysis Part of Web mining • Typical tasks include: – – – – – Text categorization (document classification) Text clustering Text summarization Opinion mining Entity/concept extraction – Information retrieval: search engines – information extraction: Question answering Supervised learning algorithms – Decision tree learning – Naïve Bayes – K-nearest neighbour – Support Vector Machines – Neural Networks – Genetic algorithms Unsupervised Learning – Document clustering • • • • HAC K-means BIRCH … Applying machine learning on text Text Representation (Feature Extraction) preprocessing Indexing Weighting Model Dimensionality Reduction Similarity measure: how to compare text Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Some slides by Huaizhong Kou Feature Extraction: Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. 16 While-more-and-textual-information-is-available-onlineeffective-retrieval-difficult-without-good-indexing-text-content Feature Extraction 5 Text-information-online-retrieval-index 2 1 1 1 1 Feature Extraction: preprocessing and Indexing(1) Training documents Identification all unique words Removal stop words  non-informative word  ex.{the,and,when,more} Removal of suffix to Word Stemming Term Weighting generate word stem grouping words  increasing the relevance  ex.{walker,walking}walk •Naive terms •Importance of term in Doc Feature Extraction: Indexing(2) Vector Space Model (VSM) is one of the most commonly used Text data models Any text document is represented by a vector of terms • Terms are typically words and/or phrases • Every term in the vocabulary becomes an independent dimension • Each term in the text document would be represented by a non zero value which will be added in the corresponding dimension • A document collection is represented as a matrix: • Where xji represents the weight of the ith term in jth document Feature Extraction:Weighting Model(1) •tf - Term Frequency weighting wij = Freqij Freqij : := the number of times jth term occurs in document Di.  Drawback: without reflection of importance factor for document discrimination. •Ex. D1 D2 A ABRTSAQWA XAO RTABBAXA QSAK B K O Q R S T D1 4 1 0 1 D2 4 2 1 0 1 1 1 1 W X 1 1 1 1 1 1 0 1 Feature Extraction:Weighting Model(2) Tf-idf: simple version wij = Freqij * log(N/ DocFreqj) . N : := the number of documents in the training document collection. DocFreqj ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection A B K O Q R S T W X D1 0 0 0 0.3 0 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0 0 Feature Extraction: Weighting Model(3) •Tf-IDF weighting = TF * IDF A B K O Q R 0 4/12 * (lg(2/2) ] 1/12*(lg(2/1) S T W X Feature Extraction: Dimension Reduction • Document Frequency Thresholding • X2-statistic • Latent Semantic Indexing • Information Gain • Mutual information Dimension Reduction:DocFreq Thresholding •Document Frequency Thresholding Training documents D Naive Terms Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Feature Terms Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query • Highly depending on the choice of terms to represent text documents – Euclidian distance (L2 norm) – L1 norm – Cosine similarity Document Similarity Measures Document Similarity measures Document Clustering: Algorithms • k-means • Hierarchic Agglomerative Clustering (HAC) •…. • BIRCH • Association Rule Hypergraph Partitioning (ARHP) •Categorical clustering (CACTUS, STIRR) •…… •STC •QDC

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Feature Extraction