Download Support Vector Machines for Large Scale Text Mining in R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Support Vector Machines for
Large Scale Text Mining in R
Ingo Feinerer1
1 Vienna
Alexandros Karatzoglou2
University of Technology, Austria
2 Telefonica
Research, Spain
COMPSTAT’2010
Motivation
I
Machine learning and data mining require classification
I
Large amounts of data
I
Use R for data intensive operations
I
Text mining is especially resource hungry
I
Highly sparse matrices
I
Need of scalable implementations
Large Scale Linear Support Vector Machines
Modified Finite Newton l2 -Svm
Given
I
m binary labeled examples {xi , yi } with yi ∈ {−1, +1}, and
I
the Svm optimization problem
m
1X
λ
ci l2 (yi w T xi ) + kw k2
w = argmin
2
2
w ∈Rd
∗
i=1
the modified finite Newton l2 -Svm method gives an efficient primal
solution.
R Extension Package svmlin
Features
Implements l2 -Svm algorithm.
I
Extends original C++ version of svmlin
by Sindhwani and Keerthi (2007).
Adds support for
I
multi-class classification
(one-against-one and one-against-all voting schemes),
I
cross-validation, and
I
a broad range of sparse matrix formats
(SparseM, Matrix, slam).
R Extension Package svmlin
Interface
model <- svmlin(matrix,
labels,
lambda = 0.1,
cross = 3)
I
Regularization parameter of λ = 0.1
I
3-fold cross-validation
I
model can be used with the predict() function
R Extension Package tm
Text mining framework in R
I
Functionality for managing text documents
I
Abstracts the process of document manipulation
I
Eases the usage of heterogeneous text formats (XML, . . . )
I
Meta data management
I
Preprocessing via transformations and filters
Exports
I
(Sparse) term-document matrices
I
Interfaces to string kernels
Available via CRAN
Data
Reuters-21578
I
News articles by Reuters news agency from 1987
I
21578 short to medium length documents in XML format
I
Wide range of topics (M&A, finance, politics, . . . )
SpamAssassin
I
Public mail corpus
I
Authentic e-mail communication with classification into
normal and unsolicited mail of various difficulty levels
I
4150 ham and 1896 spam documents
20 Newsgroups
I
19997 e-mail messages taken from 20 different newsgroups
I
Wide field of topics, e.g., atheism, computer graphics, or
motorcycles
Preprocessing
Creation of term-document matrices
I
42 seconds for Reuters-21578
I
31 seconds for SpamAssassin
I
75 seconds for 20 Newsgroups
Term-document matrix size
I
Reuters-21578: 65973 terms, 21578 documents, 24 MB
I
SpamAssassin: 151029 terms, 6046 documents, 24 MB
I
20 Newsgroups: 175685 terms, 19997 documents, 46 MB
Protocol
Compare Svm implementations
I
Runtime of svm (package e1071) vs. svmlin
I
For svm we use a linear kernel and set the cost parameter to
I
Initially sample
I
Increase training data in
I
Compare classification performance using 10-fold
cross-validation
1
10
from data set for training
1
10
steps
1
λ
Results
SpamAssassin
12
SpamAssassin
8
6
4
●
●
●
2
●
●
●
e1071
svmlin
●
●
●
0
Training time in seconds
10
●
0.2
0.4
0.6
Portion of data used for training
0.8
1.0
Results
20 Newsgroups
60
20 Newsgroups
50
●
40
30
●
●
20
●
10
●
●
●
e1071
svmlin
●
●
0
Training time in seconds
●
0.2
0.4
0.6
Portion of data used for training
0.8
1.0
Results
Reuters-21578
Reuters 21578
300
●
200
●
●
100
●
●
●
●
e1071
svmlin
●
0
Training time in seconds
400
●
●
0.2
0.4
0.6
Portion of data used for training
0.8
1.0
Conclusion
I
svmlin extension package
I
Takes advantage of sparse data
I
Computations are done in primal space (no kernel necessary)
I
Comparison with state-of-the-art svm
I
Linear scaling, faster training times
Related documents