Download Data Mining in Bioinformatics Day 4: Text Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining in Bioinformatics
Day 4: Text Mining
Karsten Borgwardt
February 18 to March 1, 2013
Machine Learning & Computational Biology Research Group
Max Planck Institute Tübingen and
Eberhard Karls Universität Tübingen
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
What is text mining?
Definition
Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the
(biomedical) literature.
Motivation
Most knowledge is stored in terms of texts, both in industry and in academia
This alone makes text mining an integral part of knowledge discovery!
Furthermore, to make text machine-readable, one has
to solve several recognition (mining) tasks on text
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
What is text mining?
Common tasks
Information retrieval: Find documents that are relevant
to a user, or to a query in a collection of documents
Document ranking: rank all documents in the collection
Document selection: classify documents into relevant
and irrelevant
Information filtering: Search newly created documents
for information that is relevant to a user
Document classification: Assign a document to a category that describes its content
Keyword co-occurrence: Find groups of keywords that
co-occur in many documents
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Evaluating text mining
Precision and Recall
Let the set of documents that are relevant to a query
be denoted as {Relevant} and the set of retrieved documents as {Retrieved}.
The precision is the percentage of retrieved documents
that are relevant to the query
|{Relevant} ∩ {Retrieved}|
precision =
|{Retrieved}|
(1)
The recall is the percentage of relevant documents that
were retrieved by the query:
recall =
|{Relevant} ∩ {Retrieved}|
|{Relevant}|
(2)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
Text representation
Tokenization
Process of identifying keywords in a document
Not all words in a text are relevant
Text mining ignores stop words
Stop words form the stop list
Stop lists are context-dependent
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Text representation
Vector space model
Given #d documents and #t terms
Model each document as a vector v in a t-dimensional
space
Weighted Term-frequency matrix
Matrix T F of size #d × #t
Entries measure association of term and document
If a term t does not occur in a document d, then
T F (d, t) = 0
If a term t does occur in a document d, then T F (d, t) >
0.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Text representation
If term t occurs in document d, then
T F (d, t) = 1
T F (d, t) = frequency of t in d (f req(d, t))
T F (d, t) = P 0f req(d,t)
f req(d,t0 )
t ∈T
T F (d, t) = 1 + log(1 + log(f req(d, t)))
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Text representation
Inverse document frequency
represents the scaling factor, or importance, of a term
A term that appears in many document is scaled down
1 + |d|
IDF (t) = log
|dt|
(3)
where |d| is the number of all documents, and |dt| is the
number of documents containing term t
TF-IDF measure
Product of term frequency and inverse document frequency:
T F -IDF (d, t) = T F (d, t)IDF (t);
(4)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Measuring similarity
Cosine measure
Let v1 and v2 be two document vectors.
The cosine similarity is defined as
v1>v2
sim(v1, v2) =
|v1||v2|
(5)
Kernels
depending on how we represent a document, there are
many kernels available for measuring similarity of these
representations
vectorial representation: vector kernels like linear,
polynomial, Gaussian RBF kernel
one long string: string kernels that count common kmers in two strings (more on that later in the course)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Keyword co-occurrence
Problem
Find sets of keyword that often co-occur
Common problem in biomedical literature: find associations between genes, proteins or other entities using
co-occurrence search
Keyword co-occurrence search is an instance of a more
general problem in data mining, called association rule
mining.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Association rules
Definitions
Let I = {I1, I2, . . . , Im} be a set of items (keywords)
Let D be the database of transactions T (collection of
documents)
A transaction T ∈ D is a set of items: T ⊆ I (a document is a set of keywords)
Let A be a set of items: A ⊆ T . An association rule is
an implication of the form
A ⊆ T ⇒ B ⊆ T,
(6)
where A, B ⊆ I and A ∩ B = ∅
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Association rules
Support and Confidence
The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D
that contain A ∪ B :
|{T ∈ D|A ⊆ T ∧ B ⊆ T }|
support(A ⇒ B) =
(7)
|{T ∈ D}|
The rule A ⇒ B has confidence c in the transaction
set D, where c is the percentage of transactions in D
containing A that also contain B :
|{T ∈ D|A ⊆ T ∧ B ⊆ T }|
confidence(A ⇒ B) =
(8)
|{T ∈ D|A ⊆ T }|
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
Association rules
Strong rules
Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold
(minconf) are called strong association rules— and
these are the ones we are after!
Finding strong rules
1. Search for all frequent itemsets (set of items that occur
in at least minsup % of all transactions)
2. Generate strong association rules from the frequent
itemsets
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Association rules
Apriori algorithm
Makes use of the Apriori property: If an itemset A is
frequent, then any subset B of A (B ⊆ A) is frequent
as well. If B is infrequent, then any superset A of B
(A ⊇ B ) is infrequent as well.
Steps
1. Determine frequent items = k -itemsets with k = 1
2. Join all pairs of frequent k -itemsets that differ in at most
1 item = candidates Ck+1 for being frequent k+1 itemsets
3. Check the frequency of these candidates Ck+1: the frequent ones form the frequent k + 1-itemsets (trick: discard any candidate immediately that contains an infrequent k -itemset)
4. Repeat from Step 2 until no more candidate is frequent
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Transduction
Known test set
Classification on text databases often means that we
know all the data we will work with before training
Hence the test set is known apriori
This setting is called ’transductive’
Can we define classifiers that exploit the known test set?
Yes!
Transductive SVM (Joachims, ICML 1999)
Trains SVM on both training and test set
Uses test data to maximise margin
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Inductive vs. transductive
Classification
Task: predict label y from features x
Classic inductive setting
Strategy: Learn classifier on (labelled) training data
Goal: Classifier shall generalise to unseen data from
same distribution
Transductive setting
Strategy: Learn classifier on (labelled) training data
AND a given (unlabelled) test dataset
Goal: Predict class labels for this particular dataset
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Why transduction?
Really necessary?
Classic approach works: train on training dataset, test
on test dataset
That is what we usually do in practice, for instance, in
cross-validation.
We usually ignore or neglect that the fact that settings
are transductive.
The benefits of transductive classification
Inductive setting: infinitely many potential classifiers
Transductive setting: finite number of equivalence
classes of classifiers
f and f 0 in same equivalence class ⇔ f and f 0 classify
points from training and test dataset identically
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
Why transduction?
Idea of Transductive SVMs
Risk on Test data ≤ Risk on Training data + confidence
interval (depends on number of equivalence classes)
Theorem by Vapnik(1998): The larger the margin, the
lower the number of equivalence classes that contain a
classifier with this margin
Find hyperplane that separates classes in training data
AND in test data with maximum margin.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Why transduction?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Transduction on text
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Transductive SVM
Linearly separable case
1
min∗ kwk2
w,b,y 2
s.t. ∀ni=1 yi[w>xi + b] ≥ 1
∀kj=1 yj∗[w>x∗j + b] ≥ 1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
Transductive SVM
Non-linearly separable case
k
n
X
X
1
∗
∗
2
ξ
ξ
+
C
kwk
+
C
min
i
j
w,b,y ∗ ,ξ,ξ ∗ 2
j=0
i=0
s.t. ∀ni=1 yi[w>xi + b] ≥ 1 − ξi
∀kj=1 yj∗[w>x∗j + b] ≥ 1 − ξj∗
∀ni=1 ξi ≥ 0
∀kj=1 ξj∗ ≥ 0
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Transductive SVM
Optimisation
How to solve this OP?
Not so nice: combination of integer and convex OP
Joachims’ approach: find approximate solution by iterative application of inductive SVM
train inductive SVM on training data, predict on test
data, assign labels to test data
retrain on all data, with special slack weights for test
data (C−∗ , C+∗ )
Outer loop: repeat and slowly increase (C−∗ , C+∗ )
Inner loop: within each repetition switch pairs of ’misclassified’ data points repeatedly
Local search with approximate solution to OP
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
Inductive SVM for TSVM
Variant of inductive SVM
k
k
n
X
X
X
1
∗
∗
∗
∗
2
ξ
+
C
ξ
+
C
ξ
kwk
+
C
min
i
−
j
+
j
w,b,y ∗ ,ξ,ξ ∗ 2
∗
∗
i=0
j:y =−1
j:y =1
j
j
s.t. ∀ni=1 yi[w>xi + b] ≥ 1 − ξi
∀kj=1 yj∗[w>x∗j + b] ≥ 1 − ξj∗
Three different penalty costs
C for points from training dataset
C−∗ for points from in test dataset currently in class −1
C+∗ for points from in test dataset currently in class +1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
Experiments
Average P/R-breakeven point on the Reuters dataset for
different training set sizes and a test size of 3,299
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
Experiments
Average P/R-breakeven point on the Reuters dataset for 17
training documents and varying test set size for the TSVM
Karsten Borgwardt: Data Mining in Bioinformatics, Page 26
Experiments
Average P/R-breakeven point on the WebKB category
’course’ for different training set sizes
Karsten Borgwardt: Data Mining in Bioinformatics, Page 27
Experiments
Average P/R-breakeven point on the WebKB category
’project’ for different training set sizes
Karsten Borgwardt: Data Mining in Bioinformatics, Page 28
Summary
Results
Transductive version of SVM
Maximizes margin on training and test data
Implementation uses variant of classic inductive SVM
Solution is approximate and fast
Works well on text, in particular on small training samples and large test sets
Karsten Borgwardt: Data Mining in Bioinformatics, Page 29
References and further reading
References
[1] T.-Joachims. Transductive Inference for Text Classification using Support Vector Machines ICML, 1999: 200209.
[2] J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Elsevier, Morgan-Kaufmann Publishers,
2006.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 30
The end
See you tomorrow! Next topic: Graph Mining
Karsten Borgwardt: Data Mining in Bioinformatics, Page 31