Download ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ICS 278: Data Mining
Lecture 12: Text Mining
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Text Mining
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Text Classification
•
Text classification has many applications
•
Data Representation
•
Classification Methods
– Spam email detection
– Automated tagging of streams of news articles, e.g., Google News
– Automated creation of Web-page taxonomies
– “Bag of words” most commonly used: either counts or binary
– Can also use “phrases” for commonly occuring combinations of words
– Naïve Bayes widely used (e.g., for spam email)
• Fast and reasonably accurate
– Support vector machines (SVMs)
• Typically the most accurate method in research studies
• But more complex computationally
– Logistic Regression (regularized)
• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Trimming the Vocabulary
•
Stopword removal:
– remove “non-content” words
• very frequent “stop words” such as “the”, “and”….
– remove very rare words, e.g., that only occur a few times in 100k documents
– Can remove 30% or more of the original unique words
•
Stemming:
– Reduce all variants of a word to a single term
– E.g., {draw, drawing, drawings} -> “draw”
– Porter stemming algorithm (1980)
• relies on a preconstructed suffix list with associated rules
• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a
consonant, replace with suffix=IZE
– BINARIZATION => BINARIZE
•
This still often leaves p ~ O(104) terms
=> a very high-dimensional classification problem!
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Classification Issues
• Typically many features, p ~ O(104) terms
• Consider n sample points in p dimensions
– Binary labels => 2n possible labelings (or dichotomies)
– A labeling is linearly separable if we can separate the labels with a
hyperplane
– Let f(n,p) = fraction of the 2n possible labelings that are linear
f(n, p) =
1
2/ 2n
Data Mining Lectures
n <= p + 1
S (n-1 choose i)
Lecture 12: Text Mining
n > p+1
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Classifying Term Vectors
• Typically multiple different words may be helpful in classifying a
particular class, e.g.,
– Class = “finance”
– Words = “stocks”, “return”, “interest”, “rate”, etc.
– Thus, classifiers that combine multiple features often do well, e.g,
• Naïve Bayes, Logistic regression, SVMs,
– Classifiers based on single features (e.g., trees) do less well
• Linear classifiers often perform well in high-dimensions
–
In many cases fewer documents in training data than dimensions,
• i.e., n < p
=> training data are linearly separable
– So again, naïve Bayes, logistic regression, linear SVMS, are all useful
– Question becomes: which linear discriminant to select?
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Probabilistic “Generative” Classifiers
• Model p( x | ck ) for each class and perform classification via Bayes rule,
c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }
• How to model p( x | ck )?
– p( x | ck ) = probability of a “bag of words” x given a class ck
– Two commonly used approaches (for text):
• Naïve Bayes: treat each term xj as being conditionally independent, given ck
• Multinomial: model a document with N words as N tosses of a p-sided die
– Other models possible but less common,
• E.g., model word order by using a Markov chain for p( x | ck )
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Assumes conditional independence assumption given the class:
p( x | ck ) = P p( xj | ck )
– Note that we model each term xj as a discrete random variable
– Binary terms (Bernoulli):
p( x | ck ) = P p( xj = 1 | ck ) P p( xj = 0 | ck )
– Non-binary terms (counts):
p( x | ck ) = P p( xj = k | ck )
can use a parametric model (e.g., Poisson) or non-parametric model
(e.g., histogram) for p(xj = k | ck ) distributions.
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Multinomial Classifier for Text
•
Multinomial Classification model
– Assume that the data are generated by a p-sided die (multinomial model)
Nx
p(x | ck )  p( Nx | ck ) p( xj | ck )
nj
j 1
– where Nx = number of terms (total count) in document x
nj = number of times term j occurs in the document
– p(Nx| ck) = probability a document has length Nx, e.g., Poisson model
• Can be dropped if thought not to be class dependent
– Here we have a single random variable for each class, and the p( xj = i | ck )
probabilities sum to 1 over i (i.e., a multinomial model)
– Probabilities typically only defined and evaluated for i=1, 2, 3…
– But “zero counts” could also be modeled if desired
• This would be equivalent to a Naïve Bayes model with a geometric distribution on
counts
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Comparing Naïve Bayes and Multinomial models
McCallum and Nigam (1998)
Found that multinomial
outperformed naïve Bayes (with
binary features) in text
classification experiments
(however, may be more a result
of using counts vs. binary)
Note on names used in the literature
- Bernoulli (or multivariate
Bernoulli) sometimes used for
binary version of Naïve Bayes
model
- multinomial model is also
referred to as “unigram” model
- multinomial model is also
sometimes (confusingly) referred
to as naïve Bayes
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
WebKB Data Set
• Train on ~5,000 hand-labeled web pages
– Cornell, Washington, U.Texas, Wisconsin
• Crawl and classify a new site (CMU)
• Results:
Student
Extracted
180
Correct
130
Accuracy:
72%
Data Mining Lectures
Faculty
66
28
42%
Person
246
194
79%
Lecture 12: Text Mining
Project
99
72
73%
Course
28
25
89%
Departmt
1
1
100%
Padhraic Smyth, UC Irvine
Probabilistic Model Comparison
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Highest Probability Terms in Multinomial Distributions
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Sample Learning Curve
(Yahoo Science Data)
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Comments on Generative Models for Text
(Comments applicable to both Naïve Bayes and Multinomial classifiers)
•
Simple and fast => popular in practice
– e.g., linear in p, n, M for both training and prediction
• Training = “smoothed” frequency counts, e.g.,
p(ck | xj  1) 
nk , j  k
nk   m
– e.g., easy to use in situations where classifier needs to be updated regularly
(e.g., for spam email)
•
Numerical issues
– Typically work with log p( ck | x ), etc., to avoid numerical underflow
– Useful trick:
• when computing S log p( xj | ck ) , for sparse data, it may be much faster to
– precompute S log p( xj = 0| ck )
– and then subtract off the log p( xj = 1| ck ) terms
•
Note: both models are “wrong”: but for classification are often sufficient
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Linear Classifiers
• Linear classifier (two-class case)
wT x + w0 > 0
– w is a p-dimensional vector of weights (learned from the data)
– w0 is a threshold (also learned from the data)
– Equation of linear hyperplane (decision boundary)
wT x + w0 = 0
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Geometry of Linear Classifiers
wT x + w0 = 0
Direction
of w vector
Distance of x from
the boundary is
1/||w|| (wT x + w0 )
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Optimal Hyperplane and Margin
M = margin
Circles = support vectors
Goal is to find weight vector
that maximizes M
Theory tells us that max-margin
hyperplane leads to good
generalization
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Optimal Separating Hyperplane
• Solution to constrained optimization problem:
max M subject to
w, w 0
1
yi ( wT xi  w0 )  M, i  1,..., n
|| w ||
min || w || subject to yi (wT xi  w0 )  1, i  1,..., n
w, w 0
(Here yi e {-1, 1} is the binary class label for example i)
• Unique for each linearly separable data set
• Margin M of the classifier
– the distance between the separating hyperplane and the closest training
samples
– optimal separating hyperplane  maximum margin
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Sketch of Optimization Problem
•
Define Langrangian as a function of w vector, and ’s
1
L(w,  )  || w || 2 
2
n
 [ y ( w
T
i 1
i
i
xi  w0 ) - 1]
• Form of solution dictates that optimal w can be expressed as
n
w    i yi xi
i 1
• This results in a quadratic programming optimization problem
– Good news:
• convex function of unknowns, unique optimum
• Variety of well-known algorithms for finding this optimum
– Bad news:
• Quadratic programming in general scales as O(n3)
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Support Vector Machines
• If i > 0 then the distance of xi from the separating hyperplane is M
– Support vectors - points with associated I > 0
• The decision function f(x) is computed from support vectors as
n
f ( x)   yi i xT xi
=> prediction can be fast
i 1
•
Non-linearly-separable case: can generalize to allow “slack” constraints
•
Non-linear SVMs: replace original x vector with non-linear functions of x
– “kernel trick” : can solve high-d problem without working directly in high d
•
Computational speedups: can reduce training time to O(n2) or even nearlinear
– e.g Platt’s SMO algorithm, Joachim’s SVMLight
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
From Chakrabarti, Chapter 5, 2002
Timing results on text classification
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Classic Reuters Data Set
•
•
•
21578 documents, labeled manually
9603 training, 3299 test articles (“ModApte” split)
118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
•
Example “interest rate” article
2-APR-1987 06:35:19.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman
said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard
emergency financing rate at 5.0 pct.
Common categories
(#train, #test)
Data Mining Lectures
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
Lecture 12: Text Mining
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Padhraic Smyth, UC Irvine
Dumais et al. 1998: Reuters - Accuracy
earn
acq
money-fx
grain
crude
trade
interest
ship
wheat
corn
Avg Top 10
Avg All Cat
Data Mining Lectures
Rocchio
NBayes
Trees
LinearSVM
92.9%
95.9%
97.8%
98.2%
64.7%
87.8%
89.7%
92.8%
46.7%
56.6%
66.2%
74.0%
67.5%
78.8%
85.0%
92.4%
70.1%
79.5%
85.0%
88.3%
65.1%
63.9%
72.5%
73.5%
63.4%
64.9%
67.1%
76.3%
49.2%
85.4%
74.2%
78.0%
68.9%
69.7%
92.5%
89.7%
48.2%
65.3%
91.8%
91.1%
64.6%
61.7%
81.5%
75.2% na
Lecture 12: Text Mining
88.4%
91.4%
86.4%
Padhraic Smyth, UC Irvine
Precision-Recall for SVM (linear),
Naïve Bayes, and NN (from Dumais 1998)
using the Reuters data set
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear
SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory,
and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Other issues in text classification
• Real-time constraints:
– Being able to update classifiers as new data arrives
– Being able to make predictions very quickly in real-time
• Multi-labels and multiple classes
– Text documents can have more than one label
– SVMs for example can only handle binary data
• Feature selection
– Experiments have shown that feature selection (e.g., by greedy
algorithms using information gain) can improve results
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine
Further Reading on Text Classification
• General references on text and language modeling
– Foundations of Statistical Language Processing, C. Manning and H.
Schutze, MIT Press, 1999.
– Speech and Language Processing: An Introduction to Natural Language
Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.
• SVMs for text classification
– T. Joachims, Learning to Classify Text using Support Vector Machines:
Methods, Theory and Algorithms, Kluwer, 2002
• Web-related text mining
– S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext
Data, Morgan Kaufmann, 2003.
Data Mining Lectures
Lecture 12: Text Mining
Padhraic Smyth, UC Irvine