Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Artificial Intelligence Programming Clustering Chris Brooks Department of Computer Science University of San Francisco Introduction We’ve talked about learning previously in the context of specific algorithms. Purpose: discuss learning more generally. Give a flavor of other approaches to learning Talk more carefully about how to evaluate the performance of a learning algorithm. Department of Computer Science — University of San Francisco – p.1/? Defining Learning So far, we’ve defined a learning agent as one that can improve its performance over time. We’ve seen two learning algorithms: Decision tree Bayesian Learning Let’s define the problem a bit more precisely. Department of Computer Science — University of San Francisco – p.2/? Defining Learning A program is said to learn from experiences E with respect to a set of tasks T and a performance measure P if its performance on T, as measured by P, improves with experience E. This means that, for a well-formulated learning problem, we need: A set of tasks the agent must perform A way to measure its performance A way to quantify the experience the agent receives Department of Computer Science — University of San Francisco – p.3/? Examples Speech recognition Task: successfully recognize spoken words Performance measure: fraction of words correctly recognized Experience: A database of labeled, spoken words Learning to drive a car Task: Drive on a public road using vision sensors Performance: average distance driven without error Experience: sequence of images and reactions from a human driver. Learning to play backgammon Task: play backgammon Performance measure: number of games won against humans of the appropriate caliber. Experience: Playing games against itself. Department of Computer Science — University of San Francisco – p.4/? Discussion Notice that not all performance measures are the same. In some cases, we want to minimize all errors. In other cases, some sorts of errors can be more easily tolerated than others. Also, not all experience is the same. Are examples labeled? Does a learning agent immediately receive a reward after selecting an action? How is experiental data represented? Symbolic? Continuous? Also: What is the final product? Do we simply need an agent that performs correctly? Or is it important that we understand why the agent performs correctly? Department of Computer Science — University of San Francisco – p.5/? Types of learning problems One way to characterize learning problems is by the sorts of data and feedback our agent has access to. batch vs incremental supervised vs unsupervised active vs passive Online vs Offline Department of Computer Science — University of San Francisco – p.6/? Classifiers As we’ve seen, classification is a particularly common (and useful) learning problem. Place unseen data into one of a set of classes. An alternative learning problem is regression. We can think of a classifier as a black box and just talk about how it performs. Department of Computer Science — University of San Francisco – p.7/? Measuring Performance How do we evaluate the performance of a classifying learning algorithm? Two traditional measures are precision and accuracy. Precision is the fraction of examples classified as belonging to class x that are really of that class. How well does our hypothesis avoid false positives? Recall (or accuracy) is the fraction of true members of class x that are actually captured by our hypothesis. How well does our hypothesis capture false negatives? Department of Computer Science — University of San Francisco – p.8/? Precision vs recall Often, there is a tradeoff of precision vs recall. In our playTennis example, what if we say we always play tennis? this will have a high accuracy, but a low precision. What if we say we’ll never play tennis? High precision, low accuracy. Try to make a compromise that best suits your application. What is a case where a false positive would be worse than a false negative? What is a case where a false negative would be better than a false positive? Department of Computer Science — University of San Francisco – p.9/? Evaluation Typically, in evaluating the performance of a learning algorithm, we’ll be interested in the following sorts of questions: Does performance improve as the number of training examples increases? How do precision and recall trade off as the number of training examples changes? How does performance change as the problem gets easier/harder? So what does ’performance’ mean? Department of Computer Science — University of San Francisco – p.10/? Evaluation Recall that supervised algorithms start with a set of labeled data. Divide this data into two subsets: Training set: used to train the classifier. Test set: used to evaluate the classifier’s performance. These sets are disjoint. Procedure: Train the algorithm with the classifier. Run each element of the test set through the classifier. Count the number of incorrectly classified examples. If the classification is binary, you can also measure precision and recall. Department of Computer Science — University of San Francisco – p.11/? Evaluation How do we know we have a representative training and test set? Try it multiple times. N-fold cross-validation: Do this N times: Select 1/N documents at random as the test set. Remainder is the training set. Test as usual. Average results. Department of Computer Science — University of San Francisco – p.12/? Ensemble learning Often, classifiers reach a point where improved performance on the training set leads to reduced performance on the test set. This is called overfitting Representational bias can also lead to upper limits in performance. One way to deal with this is through ensemble learning. Intuition: Independently train several classifiers on the same data (different training subsets) and let them vote. This is basically what the Bayes optimal classifier does. Department of Computer Science — University of San Francisco – p.13/? Bagging This idea of training multiple classifiers is known as bagging. Start with our dataset D. Generate N subsets of D of equal size. Data might be in more than one subset. Train a classifier on each subset. Use majority voting to determine classification. Prevents overfitting, can improve accuracy. Department of Computer Science — University of San Francisco – p.14/? Boosting A related technique to bagging is boosting Idea: Sequentially train classifiers to correct each other’s error. Pick your favorite classifier. For i = 1 to M : Train the ith classifier on the training set. For each misclassified example, increase its “weight” for each correctly classified example, decrease its “weight”. Department of Computer Science — University of San Francisco – p.15/? Boosting To classify : Present each test example to each classifier. Each classifier gets a vote, weighted by its precision. Very straightforward - can produce substantial performance improvement. Combining stupid classifiers can be more effective than building one smart classifier. Department of Computer Science — University of San Francisco – p.16/? Instance-Based Learning So far, all of the learning algorithms we’ve studied construct an explicit hypothesis about the data set. This is nice because it lets us do a lot of the training ahead of time. It has the weakness that we must then use the same hypothesis fro each element in the test set. One way to get around this is to construct different hypotheses for each test example. Potentially better results, but more computation needed at evaluation time. We can use this in either a supervised or unsupervised setting. Department of Computer Science — University of San Francisco – p.17/? k-nearest neighbor The most basic instance-based method is k-nearest neighbor. Assume: Each individual can be represented as an N-dimensional vector: < v1 , v2 , ..., vn >. We have a distance metric that tells us how far apart two individuals are. Euclidean distance is common: pP (x1 [i] − x2 [i])2 d(x1 , x2 ) = Department of Computer Science — University of San Francisco – p.18/? Supervised kNN Training is trivial. Store training set. Assume each individual is a n-dimensional vector, plus a classification. Testing is more computationally complex: Find the k closest points and collect their classifications. Use majority rule to classify the unseen point. Department of Computer Science — University of San Francisco – p.19/? kNN Example Suppose we have the following data points and are X1 X2 Class 4 3 + using 3-NN: 1 2 2 2 + 5 0 We see the following data point: x1=3, x2 = 1. How should we classify it? Department of Computer Science — University of San Francisco – p.20/? kNN Example Begin by computing distances: X1 X2 Class Distance √ 4 3 + 5 = 2.23 1 1 2 √ 2 2 + 2 = 1.41 5 1 2 The three closest points are 2,3,4. There are 2 ‘-’, and 1 ‘+’. Therefore the new example is negative. Department of Computer Science — University of San Francisco – p.21/? Discussion K-NN can be a very effective algorithm when you have lots of data. Easy to compute Resistant to noise. Bias: points that are “close” to each other share classification. Department of Computer Science — University of San Francisco – p.22/? Discussion Issues: How to choose the best k ? Search using cross-validation Distance is computed globally. Recall the data we used for decision tree training. Part of the goal was eliminate irrelevant attributes. All neighbors get an equal vote. Department of Computer Science — University of San Francisco – p.23/? Distance-weighted voting One extension is to weight a neighbor’s vote by its distance to the example to be classified. Each ’vote’ is weighted by the inverse square of the distance. Once we add this, we can actually drop the ’k’, and just use all instances to classify new data. Department of Computer Science — University of San Francisco – p.24/? Attribute Weighting A more serious problem with kNN is the presence of irrelevant attributes. In many data sets, there are a large number of attributes that are completely unrelated to classification. More data actually lowers classification performance. This is sometimes called the curse of dimensionality. Department of Computer Science — University of San Francisco – p.25/? Attribute Weighting We can address this problem by assigning a weight to each component of the distance calculation. pP d(p1 , p2 ) = ( w[i](p1 [i] − p2 [i]))2 where w is a vector of weights. This has the effect of transforming or stretching the instance space. More useful features have larger weights Department of Computer Science — University of San Francisco – p.26/? Learning Attribute Weights We can learn attribute weights through a hillclimbing search. let w = random weights let val(w) be the error rate for w under n-fold cross-validation while not done : for i in range(len(w)) : w[i] = w[i] + delta if val(w + w[i]) > val(w) : keep new weights We could also use a GA or simulated annealing to do this. Department of Computer Science — University of San Francisco – p.27/? Unsupervised Learning What if we want to group instances, but we don’t know their classes? We just want “similiar” instances to be in the same group. Examples: Clustering documents based on text Grouping users with similar preferences Identifying demographic groups Department of Computer Science — University of San Francisco – p.28/? K-means Clustering Let’s suppose we want to group our items into K clusters. For the moment, assume K given. Approach 1: Choose K items at random. We will call these the centers. Each center gets its own cluster. For each other item, assign it to the cluster that minimizes distance between it and the center. This is called K -means clustering. Department of Computer Science — University of San Francisco – p.29/? K-means Clustering To evaluate this, we measure the sum of all distances between instances and the center of their cluster. But how do we know that we picked good centers? Department of Computer Science — University of San Francisco – p.30/? K-means Clustering To evaluate this, we measure the sum of all distances between instances and the center of their cluster. But how do we know that we picked good centers? We don’t. We need to adjust them. Department of Computer Science — University of San Francisco – p.31/? Tuning the centers For each cluster, find its mean. This is the point c that minimizes the total distance to all points in the cluster. But what if some points are now in the wrong cluster? Department of Computer Science — University of San Francisco – p.32/? Iterate Check all points to see if they are in the correct cluster. If not, reassign them. Then recompute centers. Continue until no points change clusters. Department of Computer Science — University of San Francisco – p.33/? K-means pseudocode centers = random items while not done : foreach item : assign to closest center foreach center : find mean of its cluster. Department of Computer Science — University of San Francisco – p.34/? Hierarchical Clustering K-means produces a flat set of clusters. Each document is in exactly one cluster. What if we want a tree of clusters? Topics and subtopics. Relationships between clusters. We can do this using hierarchical clustering Department of Computer Science — University of San Francisco – p.35/? Hierarchical Clustering One application is in document processing. Given a collection of documents, organize them into clusters based on topic. No preset list of potential categories, or labeled documents. Algorithm: D = {d1 , d2 , ..., dn } While |D| > k : Find the documents di and dj that are closest according so some similarity measure. Remove them from D Construct a new d′ that is the “union” of di and dj and add it to D Department of Computer Science — University of San Francisco – p.36/? Recommender Systems One application of these sorts of approaches is in recommender systems Netflix, Amazon Goal: Suggest items to users that they’re likely to be interested in. Real goal: For a given user, find other users she is similiar to. Department of Computer Science — University of San Francisco – p.37/? Basic Approach A user is modeled as a vector of items she has rated. For every other user, compute the distance to that user. (We might also use K-means here ahead of time) Find the closest user(s), and suggest items that similar users liked. Department of Computer Science — University of San Francisco – p.38/? Advantages Computation is simple and scalable No need to model the items themselves Don’t need an ontology, or even any idea of what items are. Performs better as more data as added. Department of Computer Science — University of San Francisco – p.39/? Algorithmic Challenges Curse of dimensionality Not all items are independent We might want to learn weights for items, or combine items into larger groups. This approach tends to recommend popular items. They’re likely to have been rated by lots of people. Department of Computer Science — University of San Francisco – p.40/? Practical Challenges How to get users to rate items? How to get users to rate truthfully? What about new and unrated items? What if a user is not similiar to anyone? Department of Computer Science — University of San Francisco – p.41/? Summary Instance-based learning is a very effective approach to dealing with large numeric data sets. k-NN can be used in supervised settings. In unsupervised settings, k-means is a simple and effective choice. Most recommender systems use a form of this approach. Department of Computer Science — University of San Francisco – p.42/?