Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Our course Schedule February 10 to February 21 Lecture from 10:00 to 12:30, tutorials are projectedbased Oral exams on February 25 and 26 to get the certificate Structure 1 week of algorithmics, 1 week of bioinformatics applications Key topics: classification, clustering, feature selection, text & graph mining Lecture will provide introduction to the topic + discussion of important papers Karsten Borgwardt: Data Mining in Bioinformatics, Page 2 What is data mining? Data Mining Extracting knowledge from large amounts of data (Han and Kamber, 2006) Often used as synonym for Knowledge Discovery — yet other definitions disagree Knowledge Discovery Data cleaning Data integration Data selection Data transformation Data mining Pattern evaluation Knowledge presentation Karsten Borgwardt: Data Mining in Bioinformatics, Page 3 What is classification? Problem Given an object, which class of objects does it belong to? Given object x, predict its class label y . Examples Computer vision: Is this object a chair? Credit cards: Is this customer to be trusted? Marketing: Will this customer buy/like our product? Function prediction: Is this protein an enzyme? Gene finding: Does this sequence contain a splice site? Karsten Borgwardt: Data Mining in Bioinformatics, Page 4 What is classification? Setting Classification is usually performed in a supervised setting: We are given a training dataset. A training dataset is a dataset of pairs (xi, yi), that is objects and their known class labels The test set is a dataset of test points x0 with unknown class label The task is to predict the class label y 0 of x0 Role of y if y ∈ {0, 1}: then we are dealing with a binary classification problem if y ∈ {1, . . . , n}, (n ∈ N): a multiclass classification problem if y ∈ R: a regression problem Karsten Borgwardt: Data Mining in Bioinformatics, Page 5 Classifiers in a nutshell Nearest Neighbour Key idea: if x1 is most similar to x2, then y1 = y2 Classification by looking at the ‘Nearest Neighbour’ Naive Bayes A simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions Decision trees A series of decisions has to be taken to classify an object, based on its attributes The hierarchy of these decisions is ordered as a tree, a ‘decision tree’. Karsten Borgwardt: Data Mining in Bioinformatics, Page 6 Classifiers in a nutshell Support Vector Machine Key idea: Draw a line (plane, hyperplane) that separates two classes of data Maximise the distance between the hyperplane and the points closest to it (margin) Test point is predicted to belong to the class whose halfspace it is located in Criteria for a good classifier Accuracy Runtime and scalability Interpretability Flexibility Karsten Borgwardt: Data Mining in Bioinformatics, Page 7 Nearest Neighbour The actual classification Given xi, we predict its label yi by xj = argminx∈D ||x − xi||2 ⇒ yi = yj (1) xi’s predicted label is that of the point closest to it, that is its ‘nearest neighbour’ Runtime Naively, one has to compute the distance to all N neighbours in the dataset for each point: O(N ) for one point O(N 2) for the entire dataset Karsten Borgwardt: Data Mining in Bioinformatics, Page 8 Nearest Neighbour How to speed NN up Exploit the triangle inequality: d(x1, x2) + d(x2, x3) ≥ d(x1, x3) (2) This holds for any metric d. Metric A distance function d is a metric iff 1. d(x1, x2) ≥ 0 2. d(x1, x2) = 0 if and only if x1 = x2 3. d(x1, x2) = d(x2, x1) 4. d(x1, x3) ≤ d(x1, x2) + d(x2, x3) Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 Nearest Neighbour How to speed NN up Rewrite triangle inequality: d(x1, x2) ≥ d(x1, x3) − d(x2, x3) (3) That means if you know d(x1, x3) and d(x2, x3), you can provide a lower bound on d(x1, x2) If you know a point that is closer to x1 than d(x1, x3) − d(x2, x3), then you can avoid to compute d(x1, x2) Karsten Borgwardt: Data Mining in Bioinformatics, Page 10 Naive Bayes Bayes’ Rule P (x|C)P (C) P (C|x) = P (x) (4) Naive Bayes Classification Classify x into one of m classes C1, . . . , Cm P (x|Ci)P (Ci) argmaxCi P (Ci|x) = P (x) (5) Karsten Borgwardt: Data Mining in Bioinformatics, Page 11 Naive Bayes Three simplifications P (x) is the same for all classes, ignore this term We further assume that P (Ci) is constant for all classes 1 ≤ i ≤ m, ignore this term as well. That means P (Ci|x) ∝ P (x|Ci) (6) If x is multidimensional, that is if x contains n features x = (x1, . . . , xn), we further assume that P (x|Ci) = n Y P (xj |Ci) (7) j=1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 12 Naive Bayes The actual classification The actual classification is performed by computing n Y P (Ci|x) ∝ P (xj |Ci) (8) j=1 The three simplifications are that all classes have the same marginal probability all data points have the same marginal probability all features of an object are independent of each other Alternative name:‘Simple Bayes Classifier’ Runtime O(N mn), where N is the number of data points, m the number of classes, n the number of features Karsten Borgwardt: Data Mining in Bioinformatics, Page 13 Decision Tree Key idea Recursively split the data space into regions that contain a single class only 2 1.5 1 Y 0.5 0 −0.5 −1 −1.5 −2 −2 −1.5 −1 −0.5 0 x 0.5 1 1.5 2 Karsten Borgwardt: Data Mining in Bioinformatics, Page 14 Decision Tree Concept A decision tree is a flowchart like tree structure with a root: this is the uppermost node internal nodes: these represents tests on an attribute branches: these represent outcomes of a test leaf nodes: these hold a class label Karsten Borgwardt: Data Mining in Bioinformatics, Page 15 Decision Tree Classification given a test point x perform test on the attributes of x at the root follow the branch that corresponds to the outcome of this test repeat this procedure, until you reach a leaf node predict the label of x to be the label of that leaf node Karsten Borgwardt: Data Mining in Bioinformatics, Page 16 Decision Tree Popularity requires no domain knowledge easy to interpret construction and prediction is fast But how to construct a decision tree? Karsten Borgwardt: Data Mining in Bioinformatics, Page 17 Decision Tree Construction requires to determine a splitting criterion at each internal node this splitting criterion tells us which attribute to test at node v we would like to use the attribute that best separates the classes on the training dataset Karsten Borgwardt: Data Mining in Bioinformatics, Page 18 Decision Tree Information gain ID3 uses information gain as attribute selection measure The information content is defined as: m X Inf o(D) = − p(Ci) log2(p(Ci)), (9) i=1 where p(Ci) is the probability that an arbitrary tuple in D |Ci,D | . belongs to class Ci and is estimated by |D| This is also known as the Shannon entropy of D. Karsten Borgwardt: Data Mining in Bioinformatics, Page 19 Decision Tree Information gain Assume that attribute A was used to split D into v partitions or subsets, {D1, D2, . . . , Dv }, where Dj contains those tuples in D that have outcome aj of A. Ideally, the Dj would provide a perfect classification, but they seldomly do. How much more information do we need to arrive at an exact classification? This is quantified by v X |Dj | InfoA(D) = Info(Dj ). (10) |D| j=1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 20 Decision Tree Information gain The information gain is the loss of entropy (increase in information) that is caused by splitting with respect to attribute A Gain(A) = Info(D) − InfoA(D) (11) We pick A such that this gain is maximised. Karsten Borgwardt: Data Mining in Bioinformatics, Page 21 Decision Tree Gain ratio The information gain is biased towards attributes with a large number of values For example, an ID attribute maximises the information gain! Hence C4.5 uses an extension of information gain: the gain ratio The gain ratio is based on the split information v X |Dj | |Dj | SplitInfoA(D) = − log2( ). (12) |D| |D| j=1 and is defined as GainRatio(A) = Gain(A) SplitInfo(A) (13) Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Decision Tree Gain ratio The attribute with maximum gain ratio is selected as the splitting attribute The ratio becomes unstable, as the split information approaches zero A constraint is added to ensure that the information gain of the test selected is at least as great as the average gain over all tests examined. Karsten Borgwardt: Data Mining in Bioinformatics, Page 23 Decision Tree Gini index Attribute selection measure in CART system Gini index measures class impurity as m X Gini(D) = 1 − p2i (14) i=1 If we split via attribute A into partitions {D1, D2, . . . , Dv }, the Gini index of this partitioning is defined as v X |Dj | GiniA(D) = Gini(Dj ) (15) |D| j=1 and the reduction in impurity by a split on A is ∆Gini(D) = Gini(D) − GiniA(D) (16) Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 Support Vector Machines Hyperplane classifiers Vapnik et al. define a family of classifiers for binary classification problems. This family is the class of hyperplanes in some dot product space H, hw, xi + b = 0, (17) where w ∈ H, b ∈ R. These correspond to decision functions (‘classifiers’): f (x) = sgn(hw, xi + b) (18) Vapnik et al. proposed a learning algorithm for determining this f from the training dataset Karsten Borgwardt: Data Mining in Bioinformatics, Page 25 Support Vector Machines The optimal hyperplane maximises the margin of separation between any training point and the hyperplane max min{kx − xik|x ∈ H, hw, xi + b = 0, i = 1, . . . , m} w∈H,b∈R (19) Karsten Borgwardt: Data Mining in Bioinformatics, Page 26 Support Vector Machines Optimisation problem 1 (20) minimisew∈H,b∈R τ (w) = kwk2 2 subject to yi(hw, xii + b) ≥ 1 for all i = 1, . . . , m Why minimise 21 kwk2? 2 The size of the margin is kwk . The smaller kwk, the larger the margin. Why do we have to obey the constraints yi(hw, xii + b) ≥ 1? They ensure that all training data points of the same class are on the same side of the hyperplane and outside the margin. Karsten Borgwardt: Data Mining in Bioinformatics, Page 27 Support Vector Machines The Lagrangian We form the Lagrangian: m X 1 L(w, b, α) = kwk2 − αi(yi(hxi, wi + b) − 1) 2 i=1 (21) The Lagrangian is minimised with respect to the primal variables w and b, and maximised with respect to the dual variables αi. Karsten Borgwardt: Data Mining in Bioinformatics, Page 28 Support Vector Machines Support Vectors At optimality, ∂ ∂ L(w, b, α) = 0 and L(w, b, α) = 0 ∂b ∂w such that m m X X αiyi = 0 and w = αiyixi i=1 (22) (23) i=1 Hence the solution vector w, the crucial parameter of the SVM classifier, has an expansion in terms of the training points and their labels. Those training points with α > 0 are the Support Vectors. Karsten Borgwardt: Data Mining in Bioinformatics, Page 29 Support Vector Machines The dual problem Plugging (23) into the Lagrangian (21), we obtain the dual optimization problem that is solved in practice: m m X 1X maximiseα∈Rm W (α) = αiαj yiyj hxi, xj i αi − 2 i,j=1 i=1 (24) The kernel trick The key insight is that (24) accesses the training data only in terms of inner products hxi, xj i We can plug in an inner product of our choice here! This is referred to as a kernel k : k(xi, xj ) = hxi, xj i (25) Karsten Borgwardt: Data Mining in Bioinformatics, Page 30 Support Vector Machines Some prominent kernels linear kernel k(xi, xj ) = n X xil xjl = x> i xj , (26) l=1 polynomial kernel d x + c) , k(xi, xj ) = (x> j i (27) where c, d ∈ R, Gaussian RBF kernel k(xi, xj ) = exp(− 1 2 kx − x k ), i j 2 2σ (28) where σ ∈ R. Karsten Borgwardt: Data Mining in Bioinformatics, Page 31 References and further reading References [1] B. Schölkopf and A. Smola. Learning with Kernels. MIT press, 2002. [2] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Elsevier, Morgan-Kaufmann Publishers, 2006. Karsten Borgwardt: Data Mining in Bioinformatics, Page 32 The end See you tomorrow! Next topic: Clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 33