Download Full size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human–computer interaction wikipedia , lookup

Machine learning wikipedia , lookup

Computer Go wikipedia , lookup

Pattern recognition wikipedia , lookup

Human-Computer Interaction Institute wikipedia , lookup

Transcript
Artificial Intelligence
Programming
Clustering
Chris Brooks
Department of Computer Science
University of San Francisco
Introduction
We’ve talked about learning previously in the context of
specific algorithms.
Purpose: discuss learning more generally.
Give a flavor of other approaches to learning
Talk more carefully about how to evaluate the
performance of a learning algorithm.
Department of Computer Science — University of San Francisco – p.1/?
Defining Learning
So far, we’ve defined a learning agent as one that can
improve its performance over time.
We’ve seen two learning algorithms:
Decision tree
Bayesian Learning
Let’s define the problem a bit more precisely.
Department of Computer Science — University of San Francisco – p.2/?
Defining Learning
A program is said to learn from experiences E with
respect to a set of tasks T and a performance measure
P if its performance on T, as measured by P, improves
with experience E.
This means that, for a well-formulated learning problem,
we need:
A set of tasks the agent must perform
A way to measure its performance
A way to quantify the experience the agent receives
Department of Computer Science — University of San Francisco – p.3/?
Examples
Speech recognition
Task: successfully recognize spoken words
Performance measure: fraction of words correctly recognized
Experience: A database of labeled, spoken words
Learning to drive a car
Task: Drive on a public road using vision sensors
Performance: average distance driven without error
Experience: sequence of images and reactions from a human driver.
Learning to play backgammon
Task: play backgammon
Performance measure: number of games won against humans of the appropriate
caliber.
Experience: Playing games against itself.
Department of Computer Science — University of San Francisco – p.4/?
Discussion
Notice that not all performance measures are the same.
In some cases, we want to minimize all errors. In
other cases, some sorts of errors can be more easily
tolerated than others.
Also, not all experience is the same.
Are examples labeled?
Does a learning agent immediately receive a reward
after selecting an action?
How is experiental data represented? Symbolic?
Continuous?
Also: What is the final product?
Do we simply need an agent that performs correctly?
Or is it important that we understand why the agent
performs correctly?
Department of Computer Science — University of San Francisco – p.5/?
Types of learning problems
One way to characterize learning problems is by the
sorts of data and feedback our agent has access to.
batch vs incremental
supervised vs unsupervised
active vs passive
Online vs Offline
Department of Computer Science — University of San Francisco – p.6/?
Classifiers
As we’ve seen, classification is a particularly common
(and useful) learning problem.
Place unseen data into one of a set of classes.
An alternative learning problem is regression.
We can think of a classifier as a black box and just talk
about how it performs.
Department of Computer Science — University of San Francisco – p.7/?
Measuring Performance
How do we evaluate the performance of a classifying
learning algorithm?
Two traditional measures are precision and accuracy.
Precision is the fraction of examples classified as
belonging to class x that are really of that class.
How well does our hypothesis avoid false positives?
Recall (or accuracy) is the fraction of true members of
class x that are actually captured by our hypothesis.
How well does our hypothesis capture false
negatives?
Department of Computer Science — University of San Francisco – p.8/?
Precision vs recall
Often, there is a tradeoff of precision vs recall.
In our playTennis example, what if we say we always
play tennis?
this will have a high accuracy, but a low precision.
What if we say we’ll never play tennis?
High precision, low accuracy.
Try to make a compromise that best suits your
application.
What is a case where a false positive would be worse
than a false negative?
What is a case where a false negative would be better
than a false positive?
Department of Computer Science — University of San Francisco – p.9/?
Evaluation
Typically, in evaluating the performance of a learning
algorithm, we’ll be interested in the following sorts of
questions:
Does performance improve as the number of training
examples increases?
How do precision and recall trade off as the number
of training examples changes?
How does performance change as the problem gets
easier/harder?
So what does ’performance’ mean?
Department of Computer Science — University of San Francisco – p.10/?
Evaluation
Recall that supervised algorithms start with a set of
labeled data.
Divide this data into two subsets:
Training set: used to train the classifier.
Test set: used to evaluate the classifier’s
performance.
These sets are disjoint.
Procedure:
Train the algorithm with the classifier.
Run each element of the test set through the
classifier. Count the number of incorrectly classified
examples.
If the classification is binary, you can also measure
precision and recall.
Department of Computer Science — University of San Francisco – p.11/?
Evaluation
How do we know we have a representative training and
test set?
Try it multiple times.
N-fold cross-validation:
Do this N times:
Select 1/N documents at random as the test set.
Remainder is the training set.
Test as usual.
Average results.
Department of Computer Science — University of San Francisco – p.12/?
Ensemble learning
Often, classifiers reach a point where improved
performance on the training set leads to reduced
performance on the test set.
This is called overfitting
Representational bias can also lead to upper limits in
performance.
One way to deal with this is through ensemble learning.
Intuition: Independently train several classifiers on
the same data (different training subsets) and let
them vote.
This is basically what the Bayes optimal classifier
does.
Department of Computer Science — University of San Francisco – p.13/?
Bagging
This idea of training multiple classifiers is known as
bagging.
Start with our dataset D. Generate N subsets of D of
equal size.
Data might be in more than one subset.
Train a classifier on each subset.
Use majority voting to determine classification.
Prevents overfitting, can improve accuracy.
Department of Computer Science — University of San Francisco – p.14/?
Boosting
A related technique to bagging is boosting
Idea: Sequentially train classifiers to correct each
other’s error.
Pick your favorite classifier.
For i = 1 to M :
Train the ith classifier on the training set.
For each misclassified example, increase its
“weight”
for each correctly classified example, decrease its
“weight”.
Department of Computer Science — University of San Francisco – p.15/?
Boosting
To classify :
Present each test example to each classifier.
Each classifier gets a vote, weighted by its precision.
Very straightforward - can produce substantial
performance improvement.
Combining stupid classifiers can be more effective
than building one smart classifier.
Department of Computer Science — University of San Francisco – p.16/?
Instance-Based Learning
So far, all of the learning algorithms we’ve studied
construct an explicit hypothesis about the data set.
This is nice because it lets us do a lot of the training
ahead of time.
It has the weakness that we must then use the same
hypothesis fro each element in the test set.
One way to get around this is to construct different
hypotheses for each test example.
Potentially better results, but more computation
needed at evaluation time.
We can use this in either a supervised or unsupervised
setting.
Department of Computer Science — University of San Francisco – p.17/?
k-nearest neighbor
The most basic instance-based method is k-nearest
neighbor.
Assume:
Each individual can be represented as an
N-dimensional vector: < v1 , v2 , ..., vn >.
We have a distance metric that tells us how far apart
two individuals are.
Euclidean distance
is common:
pP
(x1 [i] − x2 [i])2
d(x1 , x2 ) =
Department of Computer Science — University of San Francisco – p.18/?
Supervised kNN
Training is trivial.
Store training set. Assume each individual is a
n-dimensional vector, plus a classification.
Testing is more computationally complex:
Find the k closest points and collect their
classifications.
Use majority rule to classify the unseen point.
Department of Computer Science — University of San Francisco – p.19/?
kNN Example
Suppose we have the following data points and are
X1 X2 Class
4
3
+
using 3-NN:
1
2
2
2
+
5
0
We see the following data point: x1=3, x2 = 1. How
should we classify it?
Department of Computer Science — University of San Francisco – p.20/?
kNN Example
Begin by computing distances:
X1 X2 Class Distance
√
4
3
+
5 = 2.23
1
1
2
√
2
2
+
2 = 1.41
5
1
2
The three closest points are 2,3,4. There are 2 ‘-’, and 1
‘+’.
Therefore the new example is negative.
Department of Computer Science — University of San Francisco – p.21/?
Discussion
K-NN can be a very effective algorithm when you have
lots of data.
Easy to compute
Resistant to noise.
Bias: points that are “close” to each other share
classification.
Department of Computer Science — University of San Francisco – p.22/?
Discussion
Issues:
How to choose the best k ?
Search using cross-validation
Distance is computed globally.
Recall the data we used for decision tree training.
Part of the goal was eliminate irrelevant attributes.
All neighbors get an equal vote.
Department of Computer Science — University of San Francisco – p.23/?
Distance-weighted voting
One extension is to weight a neighbor’s vote by its
distance to the example to be classified.
Each ’vote’ is weighted by the inverse square of the
distance.
Once we add this, we can actually drop the ’k’, and just
use all instances to classify new data.
Department of Computer Science — University of San Francisco – p.24/?
Attribute Weighting
A more serious problem with kNN is the presence of
irrelevant attributes.
In many data sets, there are a large number of attributes
that are completely unrelated to classification.
More data actually lowers classification performance.
This is sometimes called the curse of dimensionality.
Department of Computer Science — University of San Francisco – p.25/?
Attribute Weighting
We can address this problem by assigning a weight to
each component of the distance calculation.
pP
d(p1 , p2 ) = ( w[i](p1 [i] − p2 [i]))2 where w is a vector of
weights.
This has the effect of transforming or stretching the
instance space.
More useful features have larger weights
Department of Computer Science — University of San Francisco – p.26/?
Learning Attribute Weights
We can learn attribute weights through a hillclimbing
search.
let w = random weights
let val(w) be the error rate for w under n-fold cross-validation
while not done :
for i in range(len(w)) :
w[i] = w[i] + delta
if val(w + w[i]) > val(w) :
keep new weights
We could also use a GA or simulated annealing to do
this.
Department of Computer Science — University of San Francisco – p.27/?
Unsupervised Learning
What if we want to group instances, but we don’t know
their classes?
We just want “similiar” instances to be in the same
group.
Examples:
Clustering documents based on text
Grouping users with similar preferences
Identifying demographic groups
Department of Computer Science — University of San Francisco – p.28/?
K-means Clustering
Let’s suppose we want to group our items into K
clusters.
For the moment, assume K given.
Approach 1:
Choose K items at random. We will call these the
centers.
Each center gets its own cluster.
For each other item, assign it to the cluster that
minimizes distance between it and the center.
This is called K -means clustering.
Department of Computer Science — University of San Francisco – p.29/?
K-means Clustering
To evaluate this, we measure the sum of all distances
between instances and the center of their cluster.
But how do we know that we picked good centers?
Department of Computer Science — University of San Francisco – p.30/?
K-means Clustering
To evaluate this, we measure the sum of all distances
between instances and the center of their cluster.
But how do we know that we picked good centers?
We don’t. We need to adjust them.
Department of Computer Science — University of San Francisco – p.31/?
Tuning the centers
For each cluster, find its mean.
This is the point c that minimizes the total distance to
all points in the cluster.
But what if some points are now in the wrong cluster?
Department of Computer Science — University of San Francisco – p.32/?
Iterate
Check all points to see if they are in the correct cluster.
If not, reassign them.
Then recompute centers.
Continue until no points change clusters.
Department of Computer Science — University of San Francisco – p.33/?
K-means pseudocode
centers = random items
while not done :
foreach item :
assign to closest center
foreach center :
find mean of its cluster.
Department of Computer Science — University of San Francisco – p.34/?
Hierarchical Clustering
K-means produces a flat set of clusters.
Each document is in exactly one cluster.
What if we want a tree of clusters?
Topics and subtopics.
Relationships between clusters.
We can do this using hierarchical clustering
Department of Computer Science — University of San Francisco – p.35/?
Hierarchical Clustering
One application is in document processing.
Given a collection of documents, organize them into
clusters based on topic.
No preset list of potential categories, or labeled
documents.
Algorithm:
D = {d1 , d2 , ..., dn }
While |D| > k :
Find the documents di and dj that are closest
according so some similarity measure.
Remove them from D
Construct a new d′ that is the “union” of di and dj and
add it to D
Department of Computer Science — University of San Francisco – p.36/?
Recommender Systems
One application of these sorts of approaches is in
recommender systems
Netflix, Amazon
Goal: Suggest items to users that they’re likely to be
interested in.
Real goal: For a given user, find other users she is
similiar to.
Department of Computer Science — University of San Francisco – p.37/?
Basic Approach
A user is modeled as a vector of items she has rated.
For every other user, compute the distance to that user.
(We might also use K-means here ahead of time)
Find the closest user(s), and suggest items that similar
users liked.
Department of Computer Science — University of San Francisco – p.38/?
Advantages
Computation is simple and scalable
No need to model the items themselves
Don’t need an ontology, or even any idea of what
items are.
Performs better as more data as added.
Department of Computer Science — University of San Francisco – p.39/?
Algorithmic Challenges
Curse of dimensionality
Not all items are independent
We might want to learn weights for items, or combine
items into larger groups.
This approach tends to recommend popular items.
They’re likely to have been rated by lots of people.
Department of Computer Science — University of San Francisco – p.40/?
Practical Challenges
How to get users to rate items?
How to get users to rate truthfully?
What about new and unrated items?
What if a user is not similiar to anyone?
Department of Computer Science — University of San Francisco – p.41/?
Summary
Instance-based learning is a very effective approach to
dealing with large numeric data sets.
k-NN can be used in supervised settings.
In unsupervised settings, k-means is a simple and
effective choice.
Most recommender systems use a form of this
approach.
Department of Computer Science — University of San Francisco – p.42/?