Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principal component analysis wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
Multinomial logistic regression wikipedia , lookup
K-means clustering wikipedia , lookup
CSC475 Music Information Retrieval Data Mining III George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 33 Table of Contents I 1 Linear Classifiers 2 Clustering 3 Regression G. Tzanetakis 2 / 33 Decision Hyperplanes Main Idea Instead of trying to model all the training data using probability density function just consider the linear decision boundary and focus on finding an optimal one. This directly solves the classification problem (at least for the binary case) rather than transforming it to the potentially more complex problem of fitting a distribution to a set of samples. Hyperplane g (x) = wT x + w0 = 0, (1) where w = [w1 , w2 , . . . , wl ]T is the weight vector and w0 is the threshold. G. Tzanetakis 3 / 33 Geometry Decision Hyperplane G. Tzanetakis 4 / 33 The perceptron algorithm Problem How can the unknown parameters wi , i = 0 . . . l defining the decision hyperplane be determined based on a training set ? Linear Separability w∗T x > 0, ∀x ∈ ω1 w∗T x < 0, ∀x ∈ ω2 G. Tzanetakis (2) (3) 5 / 33 Perceptron Iterative Scheme Cost Function P Cost function: J(w) = x∈M (δx wT x) where M is the subset of training vectors, which are misclassified by the hyperplane defined by w . The variable δx is -1 if x ∈ ω1 and +1 if x ∈ ω2 . Notice that J(w) will always be positive and will be zero when there are no misclassified examples in the training set. ρt is a sequence of decreasing positive numbers. Iterative Algorithm Start with random weights and iterate until convergence. Multiple Local Minima (more than one hyperplane can satisfy the condition). Similar to gradient descent. X w(t + 1) = w(t) − ρt δx x (4) G. Tzanetakis 6 / 33 Neuron View G. Tzanetakis 7 / 33 Characteristics of the basic Perceptron Initial optimism replaced by scepticism. The field of ANNs was killed for ten years in the 70s. Well suited for online processing - easy to implement Only work for linearly separable datasets Prone to overfitting - no consideration of margin G. Tzanetakis 8 / 33 Multilayer Perceptron Networks Also known as Artificial Neural Networks (ANNs). Resurgence in the 1980s when the backpropagation training algorithm became popular. Backpropagation requires that the activation function is differentiable (replacing the step function used in the basic Perceptron, for example, with a sigmoid function). Allow arbitrary modeling of input to output supporting naturally classification as well as regression and multilabel classification. Fell out of fashion around 2000 with the rise of the support vector machines but have recently resurfaced with deep learning. G. Tzanetakis 9 / 33 Multilayer Perceptron Network G. Tzanetakis 10 / 33 Characteristics of ANNs Slow convergence - training Prone to overfitting due to the large number of parameters Can solve complex problems in addition to classification Unclear how to find the optimal architercture for the hidden layer(s) Easy to parallelize - deep learning resurgence has been enabled among other things by GPU processing G. Tzanetakis 11 / 33 Support Vector Machines When a problem is linearly separable there are several hyperplanes that satisfy perfect separation. If we could choose among them is there any reason to prefer one over another ? Hyperplane that leaves more “room” on either side should be chosen so that data from both classes can “move” more freely with less risk of causing an error. The formal term for this “room” is called the margin and our goal is to have the same distance to the nearest points in ω1 and ω2 . So now instead of simply searching for a hyperplane that perfectly separates the two classes we also have the additional requirement of a maximum margin. This can be cast as non-linear (quadratic) optimization problem subject to a set of linear inequality constraints. G. Tzanetakis 12 / 33 Support Vector Machines When a problem is linearly separable there are several hyperplanes that satisfy perfect separation. If we could choose among them is there any reason to prefer one over another ? Hyperplane that leaves more “room” on either side should be chosen so that data from both classes can “move” more freely with less risk of causing an error. The formal term for this “room” is called the margin and our goal is to have the same distance to the nearest points in ω1 and ω2 . So now instead of simply searching for a hyperplane that perfectly separates the two classes we also have the additional requirement of a maximum margin. This can be cast as non-linear (quadratic) optimization problem subject to a set of linear inequality constraints. G. Tzanetakis 13 / 33 Maximum Margin Separating Hyperplane G. Tzanetakis 14 / 33 Characteristics of SVMs Effective in high dimensional spaces Uses only a subset (the support vectors) of training points All operations involve inner products (Kernel trick) Do not directly provide probability estimates but can be extended to do so Require normalization of the training data (min/max) Binary classifier needs to be extended for multi-class using one-vs-all or all-pairs. Prediction can be extremely fast Generalization is backed by elegant theory Hard to implement optimization but a lot of good implementations exist G. Tzanetakis 15 / 33 Kernel Trick G. Tzanetakis 16 / 33 Decision Trees Definition Decision trees are models that predict the value of a target variables based on several input variable. Each interior node corresponds to one of the input variables and there are edges to children for each possible value of the input variable. Each leaf represents a value of the target variable given the values of the input variables corresponding to the path from the root to the leaf. The classification process is more transparent i.e easy to interpret than other algorithms. G. Tzanetakis 17 / 33 The hoops dataset (Naive Bayes Gaussian) G. Tzanetakis 18 / 33 Decision Tree Learning Typically done by recursive partitioning of the training set. Select which input variable to split based on some measure of which split is best. Several split quality measures have been proposed (for example Gini impurity, information gain). They are applied to each candidate subset and then averaged to determine the overall quality of the split. Initially only delt with categorical attributes but have been extended to deal with ordinal and contrinuous attributes (requiring a step of discretization). G. Tzanetakis 19 / 33 Split quality measures Definition Gini impurity IG (f ) = m X i=1 fi (1 − fi ) = m X 2 (fi − fi ) = 1 − i=1 m X fi 2 , (5) i=1 where fi is the fraction of items labeled with value i in the data. Definition Information Gain IE (f ) = − m X fi log2 fi (6) i=1 G. Tzanetakis 20 / 33 Decision Tree Example G. Tzanetakis 21 / 33 Decision Tree with Hoops dataset G. Tzanetakis 22 / 33 Characteristics of Decision Trees Simple to understand and interpret - white box compared to black box SVMs/ANNs (typically white box learning algorithms tend to perform worse than black box ones) Can handle categorical and numerical data Robust and can handle large datasets Parallel implementation are possible Require little data preparation NP-complete problem - heuristics such as greedy search do not guarantee “optimum” tree. Prone to overfitting - require pruning Some problems can result in very large trees G. Tzanetakis 23 / 33 Table of Contents I 1 Linear Classifiers 2 Clustering 3 Regression G. Tzanetakis 24 / 33 Clustering Definition Clustering The goal is to reveal the organization of a set of unlabeled data points by assigning them to clusters and thus effectively labeling them. It is also known as unsupervised learning. The detected groups should be “sensible” meaning that they should allow us to discover and observe similarities and differences among the patterns. Types of clustering Hierarchical (bottom-up and top-down) Data-based (input is a nSamples by nAttributes feature matrix) Affinity-based (input is a nSamples by nSamples similarity matrix) G. Tzanetakis 25 / 33 Clustering notation Let X be the data set we are interested in clustering: X = x1 , x2 , . . . , xn (7) Definition A m-clustering of X is the partitioning of X into m sets (clusters) C1 , C2 , . . . , Cm such that the following conditions hold: Ci 6= ∅, i = 1, . . . , m ∩m i=1 Ci = X Ci ∪ Cj = ∅, i 6= j, i, j = 1, . . . m G. Tzanetakis (8) (9) (10) 26 / 33 Evaluation of clustering Evaluation of clustering is more challenging than classification and requires much more subjective analysis. Many criteria have been proposed for this purpose. They can be grouped into internal (no external information is required) and external (external partition information about the “correct” clustering is provided). Explaining the criteria in detail is beyond the scope of this tutorial. A simple example of an internal criterion would be preferring clustering schemes in which the average distance between points within a cluster is lower than the average distance between points among different clusters. A simple example of an external criterion would the information gain (measuring the class purity of a cluster) if associated class labels are provided for the data points. G. Tzanetakis 27 / 33 K-Means Basic Algorithm The only input is K the number of clusters. 1 Randomly assign each data point to one of K clusters 2 For each cluster compute the mean value of the data points in it 3 Re-assign each data point to the cluster with the closest mean 4 Repeat the last two steps until means don’t change Similar in some ways to the EM-algorithm which can also be used for clustering. Applications in Vector Quantization. G. Tzanetakis 28 / 33 Clustering algorithm comparison G. Tzanetakis 29 / 33 Table of Contents I 1 Linear Classifiers 2 Clustering 3 Regression G. Tzanetakis 30 / 33 Regression Definition Regression is a predictive machine learning task in which the target value is continuous rather than discrete as in classification. Examples could be stock market index prediction, predicting sales, predicting weather etc. It can also be univariate or multivariate. In linear regression the prediction function is constrained to be a weighted linear combination of the input attributes. Error functions X |yi − f (xi )| (11) X Squarred Error = (yi − f (xi ))2 (12) Absolute Error = i G. Tzanetakis 31 / 33 Linear Regression using Least Squares Definition Optimization problem of minimizing residual - can be solved analyitically resulting in set of linear equation. minw (||Xw − y||2 )2 (13) There are many other variants of regression that can handle non-linearities, co-linearities, noise, favor sparsity and others. G. Tzanetakis 32 / 33 Linear Regression Example G. Tzanetakis 33 / 33