Download . - Villanova Computer Science

CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013 CSC 8520 Spring 2013. Paula Matuszek 1 Regression Classifiers • We said earlier that the task of a supervised learning system can be viewed as learning a function which predicts the outcome from the inputs: – Given a training set of N example pairs (x1, y1) (x2,y2)...(xn,yn), where each yj was generated by an unknown function y = f(x), discover a function h that approximates the true function y • A large class of supervised learning approaches discover h with regression methods. CSC 8520 Spring 2013. Paula Matuszek 2 Regression Classifiers • The basic idea underlying regression is: – plot all of the sample points for n-feature examples in an n-dimensional space – find an n-dimensional plane that separates the positive examples from the negative examples best • If you are familiar with the statistical concept of regression for prediction, this is the same idea. CSC 8520 Spring 2013. Paula Matuszek 3 Linear Classifiers denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Linear Classifiers denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Linear Classifiers denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Linear Classifiers denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Linear Classifiers denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Measuring Fit • For prediction, we can measure the error of our prediction by looking at how far off our predicted value is from the actual value. – compute individual errors – sum them • Typically we are much more worried about large errors than small ones – square the errors • This gives us a measure of fit which is the sum of squared errors • This best fit hypothesis can be solved analytically (see equation 18.3 in the text) CSC 8520 Spring 2013. Paula Matuszek 9 Linear Classifiers • A linear classifier is just a hypothesis determined by linear regression with a threshold added • Rather than a hard threshold we typically use a logistic function to determine the optimal cutoffs. – fitted through gradient descent CSC 8520 Spring 2013. Paula Matuszek 10 More on Regression Models • So far we have discussed linear models • We can add dimensions to the model by including higher-order terms, such as squared or cubed values of the features. • As with decision trees, we can get overfitting. If we add enough dimensions we can fit almost anything! CSC 8520 Spring 2013. Paula Matuszek 11 Support Vector Machines • A Support Vector Machine (SVM) is a classifier – It uses features of instances to decide which class each instance belongs to • It is a supervised machine-learning classifier – Training cases are used to calculate parameters for a model which can then be applied to new instances to make a decision • It is a binary classifier – it distinguishes between two classes • It is currently the most popular off-the-shelf machine learning classifier CSC 8520 Spring 2013. Paula Matuszek 12 Basic Idea Underlying SVMs • Find a line, or a plane, or a hyperplane, that separates our classes cleanly. – This is the same concept as we have seen in regression. • By finding the greatest margin separating them – This is not the same concept as we have seen in regression. What does it mean? CSC 8520 Spring 2013. Paula Matuszek 13 Linear Classifiers denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Classifier Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. Called Linear Support Vector Machine (SVM) Copyright © 2001, 2003, Andrew W. Moore Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Maximum Margin denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against Copyright © 2001, 2003, Andrew W. Moore The maximum margin linear classifier is the linear classifier with the, um, maximum margin. Called Linear Support Vector Machine (SVM) Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Why Maximum Margin? denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against Copyright © 2001, 2003, Andrew W. Moore 1. If we’ve made a small error in the location off(x,w,b) the boundary (it’s been = sign(w. x - b) jolted in its perpendicular direction) this gives us least chance of causing a The maximum misclassification. margin linear 2. Empirically it works very very classifier iswell. the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Concept Check • For which of these could we use a basic linear SVM? – A: Classify the three kinds of iris in the UC Irvine data set? – B: Classify email into spam and non-spam? – C: Classify students into likely to pass or not? • Which of these is the SVM margin? A CSC 8520 Spring 2013. Paula Matuszek B 19 Messy Data • This is all good so far. • Suppose our aren’t that neat: CSC 8520 Spring 2013. Paula Matuszek 20 Soft Margins • Intuitively, it still looks like we can make a decent separation here. – Can’t make a clean margin – But can almost do so, if we allow some errors • We introduce a slack variable, which measure the degree of misclassification and adds a cost (C) for these misclassified instances. • Tradeoff between wide margin and classification errors – High cost will give relatively narrow margins – Low cost will give broader margins but misclassify more data. • How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do CSC 8520 Spring 2013. Paula Matuszek 21 Only Two Errors, Narrow Margin CSC 8520 Spring 2013. Paula Matuszek 22 Several Errors, Wider Margin CSC 8520 Spring 2013. Paula Matuszek 23 Finding the Margin • Conceptually similar to sum of squared errors for regression • First we find the maximum margin separator – minimize the error of the points that are closest to the separator line – Formula 18.14 in the text • The margin is then the band that touches the nearest points CSC 8520 Spring 2013. Paula Matuszek 24 Evaluating SVMs • Same as evaluating any other classifier • Train on sample data, evaluate on test data (why?) • Look at: – classification accuracy – confusion matrix – sensitivity and specificity CSC 8520 Spring 2013. Paula Matuszek 25 More on Evaluating SVMs • Overfitting: very close fit to training data which takes advantage of irrelevant variations in instances – performance on test data will be much lower – may mean that your training sample isn’t representative – in SVMs, may mean that C is too high • Is the SVM actually useful? – Compare to the “majority” classifier CSC 8520 Spring 2013. Paula Matuszek 26 Non-Linearly-Separable Data • Suppose we can’t get a good linear separation of data? • As with regression, allowing non-linearity will give us better modeling of many data sets. • In SVMs, we do this by using a kernel. • A kernel is a function which maps our data into a higher-order order feature space where we can find a separating hyperplane • Common kernels are polynomial, RBF CSC 8520 Spring 2013. Paula Matuszek 27 Hard 1-dimensional dataset What can be done about this? x=0 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Hard 1-dimensional dataset x=0 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Hard 1-dimensional dataset x=0 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM Kernel Function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function • Most common: Radial-Basis-style Kernel Function: Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Kernel Trick • We don’t actually have to compute the complete higher-order function • In the 18.14 equation we only use the dot product • So we replace it with a kernel function • This means we can work with much higher dimensions without getting hopeless performance • The kernel trick in SVMs refers to all of this: using a kernel function instead of the dot product to give us separation of non-linear data without impossible performance cost. CSC 8520 Spring 2013. Paula Matuszek 34 Why SVMs? • Focus on the instances nearest the margin is paying more attention to where the differences are critical • Can handle very large feature sets effectively • In practice has been shown to work well in a variety of domains CSC 8520 Spring 2013. Paula Matuszek 35 Summary • SVMs are a form of supervised classifier • The basic SVM is binary and linear, but there are non-linear and multi-class extensions • “One key insight and one neat trick”1 – key insight: maximum margin separator – neat trick: kernel trick • A good method to try first if you have no knowledge about the domain • Applicable in a wide variety of domains 1Artificial Intelligence, a Modern Approach, third edition, Russell and Norvig, 2010, p, 744 CSC 8520 Spring 2013. Paula Matuszek 36 Learning by Analogy: Case-based Reasoning • Case-based systems are a significant chunk of AI in their own right. A case-based system has two major components: – Case base – Problem solver • The case base contains a growing set of cases, analogous to either a KB or a training set. • Problem solver has – A case retriever and – A case reasoner. • May also have a case installer. CSC 8520 Spring 2010. Paula Matuszek CSC 8520 Spring 2013. Paula Matuszek 37 Case-Based Retrieval • Cases are described as a set of features • Retrieval uses methods such as – Nearest neighbor: compare all features to all cases in KB and choose closest match – Indexed: compute and store some indices with each case and retrieve matching indices – Domain-based model clustering: CB is organized into a domain model; insertion is harder, but retrieval is easier. • Example: “documents like this one” – Features are the word frequencies in the document CSC 8520 Spring 2010. Paula Matuszek CSC 8520 Spring 2013. Paula Matuszek 38 Case-Based Reasoning Example • A frequency matrix for diagnosing system problems is a simple case-based example • Representation is a matrix of observed symptoms and causes • Each case is an entry in cell of the matrix – Critic is actual outcome of case – Learner adds entry to appropriate cells • Performer matches symptoms, chooses possible causes CSC 8520 Spring 2013. Paula Matuszek 39 Car won’t start Battery dead Out of gas Alternator Battery bad bad case 1 case 2 case 3 Car stalls at stoplights case 4 case 5 Car misfires in rainy weather Lights won’t case 1 case 2 come on CSC 8520 Spring 2013. Paula Matuszek 40 Case-based Reasoning • Definition of relevant features is critical: – Need to get the ones which influence outcomes – At the right level of granularity • The reasoner can be a complex planning and what-if reasoning system, or a simple query for missing data. • Only really becomes a “learning” system if there is a case installer as well. • Can grow cumulatively. CSC 8520 Spring 2013. Paula Matuszek 41 Neural Nets, the very short version • A neural net consists of layers of nodes, or neurons, each of which has an activation level – Nodes of each layer receive inputs from previous layers; these are combined according to a set of weights. – If the activation level is reached the node “fires” and sends inputs to the next level – The initial layer is data from cases; the final layer is expected outcomes • Learning is accomplished by modifying the weights to reduce the prediction error CSC 8520 Spring 2013. Paula Matuszek 42 Figure 18.19: A neuron Figure 18:20 A simple network CSC 8520 Spring 2013. Paula Matuszek A network with an input layer, a hidden layer, and an output layer. 43 Neural Nets, continued • The typical method of modifying the weights is back-propagation – success or failure at the output node is “propagated back” through the nodes which contributed to that output node – essentially a form of gradient descent • Number of hidden nodes/layers is complicated decision – too many = overfitting – Typical is to try several and evaluate CSC 8520 Spring 2013. Paula Matuszek 44 Reinforcement Learning • If an agent has multiple sequential actions to perform, learning needs a different mode – each action affects available future actions – feedback may not be available after every action – agent has a long-term goal to maximize • Agent learns a policy which is basically a transition function • Issues include – exploration vs exploitation – credit assignment – generalization CSC 8520 Spring 2013. Paula Matuszek 45 Supervised Learning, Summary • Learning systems which, given examples and results, learn a model which will yield correct results for future examples • Typically used for classification • Most widely applied ML category • Depends on getting relevant features and representative examples • Evaluated against separate test sample; overfitting • Usefulness should be checked against “majority” classifier CSC 8520 Spring 2013. Paula Matuszek 46 Unsupervised Learning • Typically used to refer to clustering methods which don’t require training cases – No prior definition of goal – Typical aim is “put similar things together” • • • • Grouping search results Grouping inputs to a customer response system Purchases from a web site Census data about types and costs of housing • Combinations of hand-modeled and automatic can work very well: Google News, for instance. • Still requires good feature set CSC 8520 Spring 2013. Paula Matuszek 47 Clustering Basics    Collect examples Compute similarity among examples according to some metric Group examples together such that – examples within a cluster are similar – examples in different clusters are different   Summarize each cluster Sometimes -- assign new instances to the most similar cluster CSC 8520 Spring 2013. Paula Matuszek 48 Clustering Example . .. . . . . . . . . . . . . . Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 3 Measures of Similarity • In order to do clustering we need some kind of measure of similarity. • This is basically our “critic” • Vector of values, depends on domain: – documents: bag of words, linguistic features – purchases: cost, purchaser data, item data – census data: most of what is collected • Cosine similarity CSC 8520 Spring 2013. Paula Matuszek 50 Cosine similarity measurement    Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. CSC 8520 Spring 2013. Paula Matuszek Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.pp 51 Clustering Algorithms  Hierarchical – Bottom up – Top-down  Flat – K means  Probabilistic – Expectation Maximumization (E-M) CSC 8520 Spring 2013. Paula Matuszek 52 Hierarchical Agglomerative Clustering (HAC)   Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy. Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 8520 Spring 2013. Paula Matuszek 7 53 Dendogram: Hierarchical Clustering  Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. CSC 8520 Spring 2013. Paula Matuszek http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt 54 Partitioning (Flat) Algorithms    Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions. Usually too expensive. – Effective heuristic methods: K-means algorithm. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt CSC 8520 Spring 2013. Paula Matuszek 55 K-Means Clustering      Typically provide number of desired clusters, k. Randomly choose k instances as seeds. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations. Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 8520 Spring 2013. Paula Matuszek 18 56 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 22 K-Means • Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. Overfitting again. • Results can vary based on random seed selection. – Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • The algorithm is sensitive to outliers – Data points that are far from other data points. – Could be errors in the data recording or some special data points with very different values. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt CSC 8520 Spring 2013. Paula Matuszek 58 Strengths of k-means • Strengths: – Simple: easy to understand and to implement – Efficient: Time complexity: O(tkn), • where n is the number of data points, • k is the number of clusters, and • t is the number of iterations. – Since both k and t are small. k-means is considered a linear algorithm. • K-means is most popular clustering algorithm. • In practice, performs well, especially on text. www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt CSC 8520 Spring 2013. Paula Matuszek 59 REALLY Unsupervised Learning • Turn the machine loose to learn on its own • Needs – A representation. Still need some idea of what we are trying to learn! – Good natural language processing – A context • People don’t learn very well unsupervised. • Currently some interesting research for instancelevel knowledge. • Much harder to acquire structural or relational knowledge – but we are getting there. CSC 8520 Spring 2013. Paula Matuszek 60 More Aspects of Machine Learning • Machine learning varies by degree of human intervention: – Rote -- human builds KB. Cyc – Human assisted -- human adds knowledge directed by machine. Animals, Teiresias – Human scored -- human provides training cases. Neural nets, ID3, CART. – Completely automated. -- Nearest Neighbor, other clustering methods CSC 8520 Spring 2013. Paula Matuszek 61 More Aspects of Machine Learning • Machine Learning varies by degree of transparency – Hand-built KBs are by definition clear to humans – Human-aided trees like Animals are also generally clear and meaningful, could easily be modified by humans – Inferred rules like ID3's are generally understood by humans but may not be intuitively obvious. Modifying them by hand may lead to worse results. – Systems like SVMs are typically black box: you can look at the models but it's hard to interpret them in any humanmeaningful way and essentially impossible to modify them by hand. CSC 8520 Spring 2013. Paula Matuszek 62 More Aspects of Machine Learning • Machine learning varies by goal of the process – Extend a knowledge base – Improve some kind of decision making, such as guessing an animal or classifying diseases. – Improve overall performance of a program, such as game playing – Organize large amounts of data – Find patterns or "knowledge" not previously known, often to take some action. CSC 8520 Spring 2013. Paula Matuszek 63 The Web • Machine learning is one of those fields where the web is changing everything! • Three major factors – One problematic aspect of machine learning research is finding enough data. • This is NOT an issue on the web! – Another problematic aspect is getting a critic • Web offers a lot of opportunities – A third is identifying good practical uses for machine learning • Lots of online opportunities here CSC 8520 Spring 2013. Paula Matuszek 64 Online Uses for Machine Learning • Improved search: learn from click-throughs. • Recommendations: learn from peoples’ opinions and choices. • Online games. AIs add to the background but can’t be too static. • Better targeting for ads. More learning from clickthroughs. • Customer Response Centers. Clustering, improved retrieval of responses. CSC 8520 Spring 2013. Paula Matuszek 65 Summary • Valuable both because we want to understand how humans learn and because it improves computer systems • May learn representation or actions or both • Variety of methods, some knowledge-based and some statistical • Currently very active research area • Web is providing a lot of new opportunities • Still a long way to go CSC 8520 Spring 2013. Paula Matuszek 66

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download . - Villanova Computer Science