Download Data Mining in Bioinformatics Day 1: Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining in Bioinformatics
Day 1: Classification
Karsten Borgwardt
February 10 to February 21, 2014
Machine Learning & Computational Biology Research Group
Max Planck Institute Tübingen and
Eberhard Karls Universität Tübingen
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Our course
Schedule
February 10 to February 21
Lecture from 10:00 to 12:30, tutorials are projectedbased
Oral exams on February 25 and 26 to get the certificate
Structure
1 week of algorithmics, 1 week of bioinformatics applications
Key topics: classification, clustering, feature selection,
text & graph mining
Lecture will provide introduction to the topic + discussion
of important papers
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
What is data mining?
Data Mining
Extracting knowledge from large amounts of data (Han
and Kamber, 2006)
Often used as synonym for Knowledge Discovery — yet
other definitions disagree
Knowledge Discovery
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
What is classification?
Problem
Given an object, which class of objects does it belong
to?
Given object x, predict its class label y .
Examples
Computer vision: Is this object a chair?
Credit cards: Is this customer to be trusted?
Marketing: Will this customer buy/like our product?
Function prediction: Is this protein an enzyme?
Gene finding: Does this sequence contain a splice site?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
What is classification?
Setting
Classification is usually performed in a supervised setting: We are given a training dataset.
A training dataset is a dataset of pairs (xi, yi), that is
objects and their known class labels
The test set is a dataset of test points x0 with unknown
class label
The task is to predict the class label y 0 of x0
Role of y
if y ∈ {0, 1}: then we are dealing with a binary classification problem
if y ∈ {1, . . . , n}, (n ∈ N): a multiclass classification
problem
if y ∈ R: a regression problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Classifiers in a nutshell
Nearest Neighbour
Key idea: if x1 is most similar to x2, then y1 = y2
Classification by looking at the ‘Nearest Neighbour’
Naive Bayes
A simple probabilistic classifier based on applying
Bayes’ theorem with strong (naive) independence assumptions
Decision trees
A series of decisions has to be taken to classify an object, based on its attributes
The hierarchy of these decisions is ordered as a tree, a
‘decision tree’.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Classifiers in a nutshell
Support Vector Machine
Key idea: Draw a line (plane, hyperplane) that separates
two classes of data
Maximise the distance between the hyperplane and the
points closest to it (margin)
Test point is predicted to belong to the class whose halfspace it is located in
Criteria for a good classifier
Accuracy
Runtime and scalability
Interpretability
Flexibility
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Nearest Neighbour
The actual classification
Given xi, we predict its label yi by
xj = argminx∈D ||x − xi||2 ⇒ yi = yj
(1)
xi’s predicted label is that of the point closest to it, that
is its ‘nearest neighbour’
Runtime
Naively, one has to compute the distance to all N neighbours in the dataset for each point:
O(N ) for one point
O(N 2) for the entire dataset
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Nearest Neighbour
How to speed NN up
Exploit the triangle inequality:
d(x1, x2) + d(x2, x3) ≥ d(x1, x3)
(2)
This holds for any metric d.
Metric
A distance function d is a metric iff
1. d(x1, x2) ≥ 0
2. d(x1, x2) = 0 if and only if x1 = x2
3. d(x1, x2) = d(x2, x1)
4. d(x1, x3) ≤ d(x1, x2) + d(x2, x3)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Nearest Neighbour
How to speed NN up
Rewrite triangle inequality:
d(x1, x2) ≥ d(x1, x3) − d(x2, x3)
(3)
That means if you know d(x1, x3) and d(x2, x3), you can
provide a lower bound on d(x1, x2)
If you know a point that is closer to x1 than d(x1, x3) −
d(x2, x3), then you can avoid to compute d(x1, x2)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Naive Bayes
Bayes’ Rule
P (x|C)P (C)
P (C|x) =
P (x)
(4)
Naive Bayes Classification
Classify x into one of m classes C1, . . . , Cm
P (x|Ci)P (Ci)
argmaxCi P (Ci|x) =
P (x)
(5)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Naive Bayes
Three simplifications
P (x) is the same for all classes, ignore this term
We further assume that P (Ci) is constant for all classes
1 ≤ i ≤ m, ignore this term as well.
That means
P (Ci|x) ∝ P (x|Ci)
(6)
If x is multidimensional, that is if x contains n features
x = (x1, . . . , xn), we further assume that
P (x|Ci) =
n
Y
P (xj |Ci)
(7)
j=1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
Naive Bayes
The actual classification
The actual classification is performed by computing
n
Y
P (Ci|x) ∝
P (xj |Ci)
(8)
j=1
The three simplifications are that
all classes have the same marginal probability
all data points have the same marginal probability
all features of an object are independent of each other
Alternative name:‘Simple Bayes Classifier’
Runtime
O(N mn), where N is the number of data points, m the
number of classes, n the number of features
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Decision Tree
Key idea
Recursively split the data space into regions that contain
a single class only
2
1.5
1
Y
0.5
0
−0.5
−1
−1.5
−2
−2
−1.5
−1
−0.5
0
x
0.5
1
1.5
2
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Decision Tree
Concept
A decision tree is a flowchart like tree structure with
a root: this is the uppermost node
internal nodes: these represents tests on an attribute
branches: these represent outcomes of a test
leaf nodes: these hold a class label
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Decision Tree
Classification
given a test point x
perform test on the attributes of x at the root
follow the branch that corresponds to the outcome of this
test
repeat this procedure, until you reach a leaf node
predict the label of x to be the label of that leaf node
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Decision Tree
Popularity
requires no domain knowledge
easy to interpret
construction and prediction is fast
But how to construct a decision tree?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
Decision Tree
Construction
requires to determine a splitting criterion at each internal
node
this splitting criterion tells us which attribute to test at
node v
we would like to use the attribute that best separates the
classes on the training dataset
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Decision Tree
Information gain
ID3 uses information gain as attribute selection measure
The information content is defined as:
m
X
Inf o(D) = −
p(Ci) log2(p(Ci)),
(9)
i=1
where p(Ci) is the probability that an arbitrary tuple in D
|Ci,D |
.
belongs to class Ci and is estimated by |D|
This is also known as the Shannon entropy of D.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Decision Tree
Information gain
Assume that attribute A was used to split D into v partitions or subsets, {D1, D2, . . . , Dv }, where Dj contains
those tuples in D that have outcome aj of A.
Ideally, the Dj would provide a perfect classification, but
they seldomly do.
How much more information do we need to arrive at an
exact classification?
This is quantified by
v
X
|Dj |
InfoA(D) =
Info(Dj ).
(10)
|D|
j=1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Decision Tree
Information gain
The information gain is the loss of entropy (increase in
information) that is caused by splitting with respect to
attribute A
Gain(A) = Info(D) − InfoA(D)
(11)
We pick A such that this gain is maximised.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
Decision Tree
Gain ratio
The information gain is biased towards attributes with a
large number of values
For example, an ID attribute maximises the information
gain!
Hence C4.5 uses an extension of information gain: the
gain ratio
The gain ratio is based on the split information
v
X
|Dj |
|Dj |
SplitInfoA(D) = −
log2(
).
(12)
|D|
|D|
j=1
and is defined as
GainRatio(A) =
Gain(A)
SplitInfo(A)
(13)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Decision Tree
Gain ratio
The attribute with maximum gain ratio is selected as the
splitting attribute
The ratio becomes unstable, as the split information approaches zero
A constraint is added to ensure that the information gain
of the test selected is at least as great as the average
gain over all tests examined.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
Decision Tree
Gini index
Attribute selection measure in CART system
Gini index measures class impurity as
m
X
Gini(D) = 1 −
p2i
(14)
i=1
If we split via attribute A into partitions {D1, D2, . . . , Dv },
the Gini index of this partitioning is defined as
v
X
|Dj |
GiniA(D) =
Gini(Dj )
(15)
|D|
j=1
and the reduction in impurity by a split on A is
∆Gini(D) = Gini(D) − GiniA(D)
(16)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
Support Vector Machines
Hyperplane classifiers
Vapnik et al. define a family of classifiers for binary classification problems.
This family is the class of hyperplanes in some dot product space H,
hw, xi + b = 0,
(17)
where w ∈ H, b ∈ R.
These correspond to decision functions (‘classifiers’):
f (x) = sgn(hw, xi + b)
(18)
Vapnik et al. proposed a learning algorithm for determining this f from the training dataset
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
Support Vector Machines
The optimal hyperplane
maximises the margin of separation between any training point and the hyperplane
max min{kx − xik|x ∈ H, hw, xi + b = 0, i = 1, . . . , m}
w∈H,b∈R
(19)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 26
Support Vector Machines
Optimisation problem
1
(20)
minimisew∈H,b∈R τ (w) = kwk2
2
subject to yi(hw, xii + b) ≥ 1 for all i = 1, . . . , m
Why minimise 21 kwk2?
2
The size of the margin is kwk
. The smaller kwk, the
larger the margin.
Why do we have to obey the constraints yi(hw, xii + b) ≥
1? They ensure that all training data points of the same
class are on the same side of the hyperplane and outside the margin.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 27
Support Vector Machines
The Lagrangian
We form the Lagrangian:
m
X
1
L(w, b, α) = kwk2 −
αi(yi(hxi, wi + b) − 1)
2
i=1
(21)
The Lagrangian is minimised with respect to the primal
variables w and b, and maximised with respect to the
dual variables αi.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 28
Support Vector Machines
Support Vectors
At optimality,
∂
∂
L(w, b, α) = 0 and
L(w, b, α) = 0
∂b
∂w
such that
m
m
X
X
αiyi = 0 and w =
αiyixi
i=1
(22)
(23)
i=1
Hence the solution vector w, the crucial parameter of the
SVM classifier, has an expansion in terms of the training
points and their labels.
Those training points with α > 0 are the Support Vectors.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 29
Support Vector Machines
The dual problem
Plugging (23) into the Lagrangian (21), we obtain the
dual optimization problem that is solved in practice:
m
m
X
1X
maximiseα∈Rm W (α) =
αiαj yiyj hxi, xj i
αi −
2 i,j=1
i=1
(24)
The kernel trick
The key insight is that (24) accesses the training data
only in terms of inner products hxi, xj i
We can plug in an inner product of our choice here! This
is referred to as a kernel k :
k(xi, xj ) = hxi, xj i
(25)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 30
Support Vector Machines
Some prominent kernels
linear kernel
k(xi, xj ) =
n
X
xil xjl = x>
i xj ,
(26)
l=1
polynomial kernel
d
x
+
c)
,
k(xi, xj ) = (x>
j
i
(27)
where c, d ∈ R,
Gaussian RBF kernel
k(xi, xj ) = exp(−
1
2
kx
−
x
k
),
i
j
2
2σ
(28)
where σ ∈ R.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 31
References and further reading
References
[1] B. Schölkopf and A. Smola. Learning with Kernels. MIT
press, 2002.
[2] J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Elsevier, Morgan-Kaufmann Publishers,
2006.
Karsten Borgwardt: Data Mining in Bioinformatics, Page 32
The end
See you tomorrow! Next topic: Clustering
Karsten Borgwardt: Data Mining in Bioinformatics, Page 33