Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Artificial Intelligence Programming Introduction Defining Learning We’ve talked about learning previously in the context of specific algorithms. So far, we’ve defined a learning agent as one that can improve its performance over time. Purpose: discuss learning more generally. We’ve seen two learning algorithms: Decision tree Bayesian Learning Give a flavor of other approaches to learning Talk more carefully about how to evaluate the performance of a learning algorithm. Clustering Let’s define the problem a bit more precisely. Chris Brooks Department of Computer Science University of San Francisco Department of Computer Science — University of San Fra Department of Computer Science — University of San Francisco – p.1/?? Defining Learning Examples A program is said to learn from experiences E with respect to a set of tasks T and a performance measure P if its performance on T, as measured by P, improves with experience E. This means that, for a well-formulated learning problem, we need: A set of tasks the agent must perform A way to measure its performance A way to quantify the experience the agent receives Discussion Speech recognition Task: successfully recognize spoken words Performance measure: fraction of words correctly recognized Experience: A database of labeled, spoken words Learning to drive a car Task: Drive on a public road using vision sensors Performance: average distance driven without error Experience: sequence of images and reactions from a human driver. Learning to play backgammon Task: play backgammon Performance measure: number of games won against humans of the appropriate caliber. Experience: Playing games against itself. Department of Computer Science — University of San Francisco – p.3/?? Department of Computer Science — University of San Francisco – p.4/?? Notice that not all performance measures are the same. In some cases, we want to minimize all errors. In other cases, some sorts of errors can be more easily tolerated than others. Also, not all experience is the same. Are examples labeled? Does a learning agent immediately receive a reward after selecting an action? How is experiental data represented? Symbolic? Continuous? Also: What is the final product? Do we simply need an agent that performs correctly? Or is it important that we understand why the agent performs correctly? Department of Computer Science — University of San Fra Types of learning problems Classifiers Measuring Performance One way to characterize learning problems is by the sorts of data and feedback our agent has access to. batch vs incremental supervised vs unsupervised active vs passive Online vs Offline As we’ve seen, classification is a particularly common (and useful) learning problem. Place unseen data into one of a set of classes. How do we evaluate the performance of a classifying learning algorithm? An alternative learning problem is regression. Precision is the fraction of examples classified as belonging to class x that are really of that class. How well does our hypothesis avoid false positives? We can think of a classifier as a black box and just talk about how it performs. Two traditional measures are precision and accuracy. Recall (or accuracy) is the fraction of true members of class x that are actually captured by our hypothesis. How well does our hypothesis capture false negatives? Department of Computer Science — University of San Francisco – p.6/?? Department of Computer Science — University of San Francisco – p.7/?? Department of Computer Science — University of San Fra Precision vs recall Evaluation Evaluation Often, there is a tradeoff of precision vs recall. In our playTennis example, what if we say we always play tennis? this will have a high accuracy, but a low precision. What if we say we’ll never play tennis? High precision, low accuracy. Typically, in evaluating the performance of a learning algorithm, we’ll be interested in the following sorts of questions: Does performance improve as the number of training examples increases? How do precision and recall trade off as the number of training examples changes? How does performance change as the problem gets easier/harder? Recall that supervised algorithms start with a set of labeled data. Try to make a compromise that best suits your application. What is a case where a false positive would be worse than a false negative? So what does ’performance’ mean? What is a case where a false negative would be better than a false positive? Department of Computer Science — University of San Francisco – p.9/?? Department of Computer Science — University of San Francisco – p.10/?? Divide this data into two subsets: Training set: used to train the classifier. Test set: used to evaluate the classifier’s performance. These sets are disjoint. Procedure: Train the algorithm with the classifier. Run each element of the test set through the classifier. Count the number of incorrectly classified examples. If the classification is binary, you can also measure precision and recall. Department of Computer Science — University of San Fran Evaluation Ensemble learning Bagging How do we know we have a representative training and test set? Often, classifiers reach a point where improved performance on the training set leads to reduced performance on the test set. This is called overfitting This idea of training multiple classifiers is known as bagging. Try it multiple times. N-fold cross-validation: Do this N times: Select 1/N documents at random as the test set. Remainder is the training set. Test as usual. Average results. Representational bias can also lead to upper limits in performance. One way to deal with this is through ensemble learning. Intuition: Independently train several classifiers on the same data (different training subsets) and let them vote. This is basically what the Bayes optimal classifier does. Department of Computer Science — University of San Francisco – p.12/?? Department of Computer Science — University of San Francisco – p.13/?? Start with our dataset D. Generate N subsets of D of equal size. Data might be in more than one subset. Train a classifier on each subset. Use majority voting to determine classification. Prevents overfitting, can improve accuracy. Department of Computer Science — University of San Fran Boosting Boosting Instance-Based Learning A related technique to bagging is boosting Idea: Sequentially train classifiers to correct each other’s error. To classify : Present each test example to each classifier. Each classifier gets a vote, weighted by its precision. So far, all of the learning algorithms we’ve studied construct an explicit hypothesis about the data set. Pick your favorite classifier. For i = 1 to M : Train the ith classifier on the training set. For each misclassified example, increase its “weight” for each correctly classified example, decrease its “weight”. Very straightforward - can produce substantial performance improvement. Combining stupid classifiers can be more effective than building one smart classifier. This is nice because it lets us do a lot of the training ahead of time. It has the weakness that we must then use the same hypothesis fro each element in the test set. One way to get around this is to construct different hypotheses for each test example. Potentially better results, but more computation needed at evaluation time. We can use this in either a supervised or unsupervised setting. Department of Computer Science — University of San Francisco – p.15/?? Department of Computer Science — University of San Francisco – p.16/?? Department of Computer Science — University of San Fran k-nearest neighbor Supervised kNN kNN Example The most basic instance-based method is k-nearest neighbor. Training is trivial. Store training set. Assume each individual is a n-dimensional vector, plus a classification. Suppose we have the following data points and are Assume: Each individual can be represented as an N-dimensional vector: < v1 , v2 , ..., vn >. We have a distance metric that tells us how far apart two individuals are. Euclidean distance is common: pP (x1 [i] − x2 [i])2 d(x1 , x2 ) = Testing is more computationally complex: Find the k closest points and collect their classifications. Use majority rule to classify the unseen point. X1 X2 Class 4 3 + using 3-NN: 1 2 2 2 + 5 0 We see the following data point: x1=3, x2 = 1. How should we classify it? Department of Computer Science — University of San Francisco – p.19/?? Department of Computer Science — University of San Francisco – p.18/?? Department of Computer Science — University of San Fran kNN Example Discussion Discussion Begin by computing distances: K-NN can be a very effective algorithm when you have lots of data. Easy to compute Resistant to noise. Issues: X1 X2 Class Distance √ 4 3 + 5 = 2.23 1 1 2 √ 2 2 + 2 = 1.41 5 1 2 Bias: points that are “close” to each other share classification. How to choose the best k ? Search using cross-validation Distance is computed globally. Recall the data we used for decision tree training. Part of the goal was eliminate irrelevant attributes. All neighbors get an equal vote. The three closest points are 2,3,4. There are 2 ‘-’, and 1 ‘+’. Therefore the new example is negative. Department of Computer Science — University of San Francisco – p.21/?? Department of Computer Science — University of San Francisco – p.22/?? Department of Computer Science — University of San Fran Distance-weighted voting Attribute Weighting Attribute Weighting One extension is to weight a neighbor’s vote by its distance to the example to be classified. A more serious problem with kNN is the presence of irrelevant attributes. Each ’vote’ is weighted by the inverse square of the distance. In many data sets, there are a large number of attributes that are completely unrelated to classification. We can address this problem by assigning a weight to each component of the distance calculation. pP d(p1 , p2 ) = ( w[i](p1 [i] − p2 [i]))2 where w is a vector of weights. Once we add this, we can actually drop the ’k’, and just use all instances to classify new data. More data actually lowers classification performance. This is sometimes called the curse of dimensionality. This has the effect of transforming or stretching the instance space. More useful features have larger weights Department of Computer Science — University of San Francisco – p.24/?? Department of Computer Science — University of San Francisco – p.25/?? Department of Computer Science — University of San Fran Learning Attribute Weights Unsupervised Learning K-means Clustering We can learn attribute weights through a hillclimbing search. What if we want to group instances, but we don’t know their classes? let w = random weights let val(w) be the error rate for w under n-fold cross-validation while not done : for i in range(len(w)) : w[i] = w[i] + delta if val(w + w[i]) > val(w) : keep new weights We just want “similiar” instances to be in the same group. Let’s suppose we want to group our items into K clusters. For the moment, assume K given. We could also use a GA or simulated annealing to do this. Examples: Clustering documents based on text Grouping users with similar preferences Identifying demographic groups Approach 1: Choose K items at random. We will call these the centers. Each center gets its own cluster. For each other item, assign it to the cluster that minimizes distance between it and the center. This is called K -means clustering. Department of Computer Science — University of San Francisco – p.27/?? Department of Computer Science — University of San Francisco – p.28/?? Department of Computer Science — University of San Fran K-means Clustering K-means Clustering Tuning the centers To evaluate this, we measure the sum of all distances between instances and the center of their cluster. To evaluate this, we measure the sum of all distances between instances and the center of their cluster. But how do we know that we picked good centers? But how do we know that we picked good centers? For each cluster, find its mean. This is the point c that minimizes the total distance to all points in the cluster. We don’t. We need to adjust them. But what if some points are now in the wrong cluster? Department of Computer Science — University of San Francisco – p.30/?? Iterate Department of Computer Science — University of San Francisco – p.31/?? K-means pseudocode Department of Computer Science — University of San Fran Hierarchical Clustering Check all points to see if they are in the correct cluster. centers = random items K-means produces a flat set of clusters. If not, reassign them. while not done : foreach item : assign to closest center foreach center : find mean of its cluster. Each document is in exactly one cluster. Then recompute centers. Continue until no points change clusters. What if we want a tree of clusters? Topics and subtopics. Relationships between clusters. We can do this using hierarchical clustering Department of Computer Science — University of San Francisco – p.33/?? Department of Computer Science — University of San Francisco – p.34/?? Department of Computer Science — University of San Fran Hierarchical Clustering Recommender Systems Basic Approach One application is in document processing. One application of these sorts of approaches is in recommender systems Netflix, Amazon A user is modeled as a vector of items she has rated. Given a collection of documents, organize them into clusters based on topic. No preset list of potential categories, or labeled documents. Goal: Suggest items to users that they’re likely to be interested in. Find the closest user(s), and suggest items that similar users liked. Algorithm: Real goal: For a given user, find other users she is similiar to. D = {d1 , d2 , ..., dn } For every other user, compute the distance to that user. (We might also use K-means here ahead of time) While |D| > k : Find the documents di and dj that are closest according so some similarity measure. Remove them from D Construct a new d′ that is the “union” of di and dj and add it to D Department of Computer Science — University of San Francisco – p.36/?? Department of Computer Science — University of San Fran Department of Computer Science — University of San Francisco – p.37/?? Result: a tree of categories emerges from a collection of Advantages Algorithmic Challenges Practical Challenges Computation is simple and scalable Curse of dimensionality How to get users to rate items? No need to model the items themselves Don’t need an ontology, or even any idea of what items are. Not all items are independent We might want to learn weights for items, or combine items into larger groups. How to get users to rate truthfully? Performs better as more data as added. This approach tends to recommend popular items. They’re likely to have been rated by lots of people. Department of Computer Science — University of San Francisco – p.39/?? Department of Computer Science — University of San Francisco – p.40/?? What about new and unrated items? What if a user is not similiar to anyone? Department of Computer Science — University of San Fran Summary Instance-based learning is a very effective approach to dealing with large numeric data sets. k-NN can be used in supervised settings. In unsupervised settings, k-means is a simple and effective choice. Most recommender systems use a form of this approach. Department of Computer Science — University of San Francisco – p.42/??