Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ICS 278: Data Mining Lecture 12: Text Mining Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Classification • Text classification has many applications • Data Representation • Classification Methods – Spam email detection – Automated tagging of streams of news articles, e.g., Google News – Automated creation of Web-page taxonomies – “Bag of words” most commonly used: either counts or binary – Can also use “phrases” for commonly occuring combinations of words – Naïve Bayes widely used (e.g., for spam email) • Fast and reasonably accurate – Support vector machines (SVMs) • Typically the most accurate method in research studies • But more complex computationally – Logistic Regression (regularized) • Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trimming the Vocabulary • Stopword removal: – remove “non-content” words • very frequent “stop words” such as “the”, “and”…. – remove very rare words, e.g., that only occur a few times in 100k documents – Can remove 30% or more of the original unique words • Stemming: – Reduce all variants of a word to a single term – E.g., {draw, drawing, drawings} -> “draw” – Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE – BINARIZATION => BINARIZE • This still often leaves p ~ O(104) terms => a very high-dimensional classification problem! Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Classification Issues • Typically many features, p ~ O(104) terms • Consider n sample points in p dimensions – Binary labels => 2n possible labelings (or dichotomies) – A labeling is linearly separable if we can separate the labels with a hyperplane – Let f(n,p) = fraction of the 2n possible labelings that are linear f(n, p) = 1 2/ 2n Data Mining Lectures n <= p + 1 S (n-1 choose i) Lecture 12: Text Mining n > p+1 Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Classifying Term Vectors • Typically multiple different words may be helpful in classifying a particular class, e.g., – Class = “finance” – Words = “stocks”, “return”, “interest”, “rate”, etc. – Thus, classifiers that combine multiple features often do well, e.g, • Naïve Bayes, Logistic regression, SVMs, – Classifiers based on single features (e.g., trees) do less well • Linear classifiers often perform well in high-dimensions – In many cases fewer documents in training data than dimensions, • i.e., n < p => training data are linearly separable – So again, naïve Bayes, logistic regression, linear SVMS, are all useful – Question becomes: which linear discriminant to select? Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Probabilistic “Generative” Classifiers • Model p( x | ck ) for each class and perform classification via Bayes rule, c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) } • How to model p( x | ck )? – p( x | ck ) = probability of a “bag of words” x given a class ck – Two commonly used approaches (for text): • Naïve Bayes: treat each term xj as being conditionally independent, given ck • Multinomial: model a document with N words as N tosses of a p-sided die – Other models possible but less common, • E.g., model word order by using a Markov chain for p( x | ck ) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model – Assumes conditional independence assumption given the class: p( x | ck ) = P p( xj | ck ) – Note that we model each term xj as a discrete random variable – Binary terms (Bernoulli): p( x | ck ) = P p( xj = 1 | ck ) P p( xj = 0 | ck ) – Non-binary terms (counts): p( x | ck ) = P p( xj = k | ck ) can use a parametric model (e.g., Poisson) or non-parametric model (e.g., histogram) for p(xj = k | ck ) distributions. Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Multinomial Classifier for Text • Multinomial Classification model – Assume that the data are generated by a p-sided die (multinomial model) Nx p(x | ck ) p( Nx | ck ) p( xj | ck ) nj j 1 – where Nx = number of terms (total count) in document x nj = number of times term j occurs in the document – p(Nx| ck) = probability a document has length Nx, e.g., Poisson model • Can be dropped if thought not to be class dependent – Here we have a single random variable for each class, and the p( xj = i | ck ) probabilities sum to 1 over i (i.e., a multinomial model) – Probabilities typically only defined and evaluated for i=1, 2, 3… – But “zero counts” could also be modeled if desired • This would be equivalent to a Naïve Bayes model with a geometric distribution on counts Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing Naïve Bayes and Multinomial models McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments (however, may be more a result of using counts vs. binary) Note on names used in the literature - Bernoulli (or multivariate Bernoulli) sometimes used for binary version of Naïve Bayes model - multinomial model is also referred to as “unigram” model - multinomial model is also sometimes (confusingly) referred to as naïve Bayes Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine WebKB Data Set • Train on ~5,000 hand-labeled web pages – Cornell, Washington, U.Texas, Wisconsin • Crawl and classify a new site (CMU) • Results: Student Extracted 180 Correct 130 Accuracy: 72% Data Mining Lectures Faculty 66 28 42% Person 246 194 79% Lecture 12: Text Mining Project 99 72 73% Course 28 25 89% Departmt 1 1 100% Padhraic Smyth, UC Irvine Probabilistic Model Comparison Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Highest Probability Terms in Multinomial Distributions Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Sample Learning Curve (Yahoo Science Data) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comments on Generative Models for Text (Comments applicable to both Naïve Bayes and Multinomial classifiers) • Simple and fast => popular in practice – e.g., linear in p, n, M for both training and prediction • Training = “smoothed” frequency counts, e.g., p(ck | xj 1) nk , j k nk m – e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for spam email) • Numerical issues – Typically work with log p( ck | x ), etc., to avoid numerical underflow – Useful trick: • when computing S log p( xj | ck ) , for sparse data, it may be much faster to – precompute S log p( xj = 0| ck ) – and then subtract off the log p( xj = 1| ck ) terms • Note: both models are “wrong”: but for classification are often sufficient Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Linear Classifiers • Linear classifier (two-class case) wT x + w0 > 0 – w is a p-dimensional vector of weights (learned from the data) – w0 is a threshold (also learned from the data) – Equation of linear hyperplane (decision boundary) wT x + w0 = 0 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Geometry of Linear Classifiers wT x + w0 = 0 Direction of w vector Distance of x from the boundary is 1/||w|| (wT x + w0 ) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Optimal Hyperplane and Margin M = margin Circles = support vectors Goal is to find weight vector that maximizes M Theory tells us that max-margin hyperplane leads to good generalization Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Optimal Separating Hyperplane • Solution to constrained optimization problem: max M subject to w, w 0 1 yi ( wT xi w0 ) M, i 1,..., n || w || min || w || subject to yi (wT xi w0 ) 1, i 1,..., n w, w 0 (Here yi e {-1, 1} is the binary class label for example i) • Unique for each linearly separable data set • Margin M of the classifier – the distance between the separating hyperplane and the closest training samples – optimal separating hyperplane maximum margin Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Sketch of Optimization Problem • Define Langrangian as a function of w vector, and ’s 1 L(w, ) || w || 2 2 n [ y ( w T i 1 i i xi w0 ) - 1] • Form of solution dictates that optimal w can be expressed as n w i yi xi i 1 • This results in a quadratic programming optimization problem – Good news: • convex function of unknowns, unique optimum • Variety of well-known algorithms for finding this optimum – Bad news: • Quadratic programming in general scales as O(n3) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Support Vector Machines • If i > 0 then the distance of xi from the separating hyperplane is M – Support vectors - points with associated I > 0 • The decision function f(x) is computed from support vectors as n f ( x) yi i xT xi => prediction can be fast i 1 • Non-linearly-separable case: can generalize to allow “slack” constraints • Non-linear SVMs: replace original x vector with non-linear functions of x – “kernel trick” : can solve high-d problem without working directly in high d • Computational speedups: can reduce training time to O(n2) or even nearlinear – e.g Platt’s SMO algorithm, Joachim’s SVMLight Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine From Chakrabarti, Chapter 5, 2002 Timing results on text classification Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Classic Reuters Data Set • • • 21578 documents, labeled manually 9603 training, 3299 test articles (“ModApte” split) 118 categories – An article can be in more than one category – Learn 118 binary category distinctions • Example “interest rate” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. Common categories (#train, #test) Data Mining Lectures • • • • • Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Lecture 12: Text Mining • • • • • Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Padhraic Smyth, UC Irvine Dumais et al. 1998: Reuters - Accuracy earn acq money-fx grain crude trade interest ship wheat corn Avg Top 10 Avg All Cat Data Mining Lectures Rocchio NBayes Trees LinearSVM 92.9% 95.9% 97.8% 98.2% 64.7% 87.8% 89.7% 92.8% 46.7% 56.6% 66.2% 74.0% 67.5% 78.8% 85.0% 92.4% 70.1% 79.5% 85.0% 88.3% 65.1% 63.9% 72.5% 73.5% 63.4% 64.9% 67.1% 76.3% 49.2% 85.4% 74.2% 78.0% 68.9% 69.7% 92.5% 89.7% 48.2% 65.3% 91.8% 91.1% 64.6% 61.7% 81.5% 75.2% na Lecture 12: Text Mining 88.4% 91.4% 86.4% Padhraic Smyth, UC Irvine Precision-Recall for SVM (linear), Naïve Bayes, and NN (from Dumais 1998) using the Reuters data set Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5. Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Other issues in text classification • Real-time constraints: – Being able to update classifiers as new data arrives – Being able to make predictions very quickly in real-time • Multi-labels and multiple classes – Text documents can have more than one label – SVMs for example can only handle binary data • Feature selection – Experiments have shown that feature selection (e.g., by greedy algorithms using information gain) can improve results Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Further Reading on Text Classification • General references on text and language modeling – Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. – Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000. • SVMs for text classification – T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002 • Web-related text mining – S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine