Download Printable

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human–computer interaction wikipedia , lookup

Machine learning wikipedia , lookup

Computer Go wikipedia , lookup

Pattern recognition wikipedia , lookup

Human-Computer Interaction Institute wikipedia , lookup

Transcript
Artificial Intelligence
Programming
Introduction
Defining Learning
We’ve talked about learning previously in the context of
specific algorithms.
So far, we’ve defined a learning agent as one that can
improve its performance over time.
Purpose: discuss learning more generally.
We’ve seen two learning algorithms:
Decision tree
Bayesian Learning
Give a flavor of other approaches to learning
Talk more carefully about how to evaluate the
performance of a learning algorithm.
Clustering
Let’s define the problem a bit more precisely.
Chris Brooks
Department of Computer Science
University of San Francisco
Department of Computer Science — University of San Fra
Department of Computer Science — University of San Francisco – p.1/??
Defining Learning
Examples
A program is said to learn from experiences E with
respect to a set of tasks T and a performance measure
P if its performance on T, as measured by P, improves
with experience E.
This means that, for a well-formulated learning problem,
we need:
A set of tasks the agent must perform
A way to measure its performance
A way to quantify the experience the agent receives
Discussion
Speech recognition
Task: successfully recognize spoken words
Performance measure: fraction of words correctly recognized
Experience: A database of labeled, spoken words
Learning to drive a car
Task: Drive on a public road using vision sensors
Performance: average distance driven without error
Experience: sequence of images and reactions from a human driver.
Learning to play backgammon
Task: play backgammon
Performance measure: number of games won against humans of the appropriate
caliber.
Experience: Playing games against itself.
Department of Computer Science — University of San Francisco – p.3/??
Department of Computer Science — University of San Francisco – p.4/??
Notice that not all performance measures are the same.
In some cases, we want to minimize all errors. In
other cases, some sorts of errors can be more easily
tolerated than others.
Also, not all experience is the same.
Are examples labeled?
Does a learning agent immediately receive a reward
after selecting an action?
How is experiental data represented? Symbolic?
Continuous?
Also: What is the final product?
Do we simply need an agent that performs correctly?
Or is it important that we understand why the agent
performs correctly?
Department of Computer Science — University of San Fra
Types of learning problems
Classifiers
Measuring Performance
One way to characterize learning problems is by the
sorts of data and feedback our agent has access to.
batch vs incremental
supervised vs unsupervised
active vs passive
Online vs Offline
As we’ve seen, classification is a particularly common
(and useful) learning problem.
Place unseen data into one of a set of classes.
How do we evaluate the performance of a classifying
learning algorithm?
An alternative learning problem is regression.
Precision is the fraction of examples classified as
belonging to class x that are really of that class.
How well does our hypothesis avoid false positives?
We can think of a classifier as a black box and just talk
about how it performs.
Two traditional measures are precision and accuracy.
Recall (or accuracy) is the fraction of true members of
class x that are actually captured by our hypothesis.
How well does our hypothesis capture false
negatives?
Department of Computer Science — University of San Francisco – p.6/??
Department of Computer Science — University of San Francisco – p.7/??
Department of Computer Science — University of San Fra
Precision vs recall
Evaluation
Evaluation
Often, there is a tradeoff of precision vs recall.
In our playTennis example, what if we say we always
play tennis?
this will have a high accuracy, but a low precision.
What if we say we’ll never play tennis?
High precision, low accuracy.
Typically, in evaluating the performance of a learning
algorithm, we’ll be interested in the following sorts of
questions:
Does performance improve as the number of training
examples increases?
How do precision and recall trade off as the number
of training examples changes?
How does performance change as the problem gets
easier/harder?
Recall that supervised algorithms start with a set of
labeled data.
Try to make a compromise that best suits your
application.
What is a case where a false positive would be worse
than a false negative?
So what does ’performance’ mean?
What is a case where a false negative would be better
than a false positive?
Department of Computer Science — University of San Francisco – p.9/??
Department of Computer Science — University of San Francisco – p.10/??
Divide this data into two subsets:
Training set: used to train the classifier.
Test set: used to evaluate the classifier’s
performance.
These sets are disjoint.
Procedure:
Train the algorithm with the classifier.
Run each element of the test set through the
classifier. Count the number of incorrectly classified
examples.
If the classification is binary, you can also measure
precision and recall.
Department of Computer Science — University of San Fran
Evaluation
Ensemble learning
Bagging
How do we know we have a representative training and
test set?
Often, classifiers reach a point where improved
performance on the training set leads to reduced
performance on the test set.
This is called overfitting
This idea of training multiple classifiers is known as
bagging.
Try it multiple times.
N-fold cross-validation:
Do this N times:
Select 1/N documents at random as the test set.
Remainder is the training set.
Test as usual.
Average results.
Representational bias can also lead to upper limits in
performance.
One way to deal with this is through ensemble learning.
Intuition: Independently train several classifiers on
the same data (different training subsets) and let
them vote.
This is basically what the Bayes optimal classifier
does.
Department of Computer Science — University of San Francisco – p.12/??
Department of Computer Science — University of San Francisco – p.13/??
Start with our dataset D. Generate N subsets of D of
equal size.
Data might be in more than one subset.
Train a classifier on each subset.
Use majority voting to determine classification.
Prevents overfitting, can improve accuracy.
Department of Computer Science — University of San Fran
Boosting
Boosting
Instance-Based Learning
A related technique to bagging is boosting
Idea: Sequentially train classifiers to correct each
other’s error.
To classify :
Present each test example to each classifier.
Each classifier gets a vote, weighted by its precision.
So far, all of the learning algorithms we’ve studied
construct an explicit hypothesis about the data set.
Pick your favorite classifier.
For i = 1 to M :
Train the ith classifier on the training set.
For each misclassified example, increase its
“weight”
for each correctly classified example, decrease its
“weight”.
Very straightforward - can produce substantial
performance improvement.
Combining stupid classifiers can be more effective
than building one smart classifier.
This is nice because it lets us do a lot of the training
ahead of time.
It has the weakness that we must then use the same
hypothesis fro each element in the test set.
One way to get around this is to construct different
hypotheses for each test example.
Potentially better results, but more computation
needed at evaluation time.
We can use this in either a supervised or unsupervised
setting.
Department of Computer Science — University of San Francisco – p.15/??
Department of Computer Science — University of San Francisco – p.16/??
Department of Computer Science — University of San Fran
k-nearest neighbor
Supervised kNN
kNN Example
The most basic instance-based method is k-nearest
neighbor.
Training is trivial.
Store training set. Assume each individual is a
n-dimensional vector, plus a classification.
Suppose we have the following data points and are
Assume:
Each individual can be represented as an
N-dimensional vector: < v1 , v2 , ..., vn >.
We have a distance metric that tells us how far apart
two individuals are.
Euclidean distance
is common:
pP
(x1 [i] − x2 [i])2
d(x1 , x2 ) =
Testing is more computationally complex:
Find the k closest points and collect their
classifications.
Use majority rule to classify the unseen point.
X1 X2 Class
4
3
+
using 3-NN:
1
2
2
2
+
5
0
We see the following data point: x1=3, x2 = 1. How
should we classify it?
Department of Computer Science — University of San Francisco – p.19/??
Department of Computer Science — University of San Francisco – p.18/??
Department of Computer Science — University of San Fran
kNN Example
Discussion
Discussion
Begin by computing distances:
K-NN can be a very effective algorithm when you have
lots of data.
Easy to compute
Resistant to noise.
Issues:
X1 X2 Class Distance
√
4
3
+
5 = 2.23
1
1
2
√
2
2
+
2 = 1.41
5
1
2
Bias: points that are “close” to each other share
classification.
How to choose the best k ?
Search using cross-validation
Distance is computed globally.
Recall the data we used for decision tree training.
Part of the goal was eliminate irrelevant attributes.
All neighbors get an equal vote.
The three closest points are 2,3,4. There are 2 ‘-’, and 1
‘+’.
Therefore the new example is negative.
Department of Computer Science — University of San Francisco – p.21/??
Department of Computer Science — University of San Francisco – p.22/??
Department of Computer Science — University of San Fran
Distance-weighted voting
Attribute Weighting
Attribute Weighting
One extension is to weight a neighbor’s vote by its
distance to the example to be classified.
A more serious problem with kNN is the presence of
irrelevant attributes.
Each ’vote’ is weighted by the inverse square of the
distance.
In many data sets, there are a large number of attributes
that are completely unrelated to classification.
We can address this problem by assigning a weight to
each component of the distance calculation.
pP
d(p1 , p2 ) = ( w[i](p1 [i] − p2 [i]))2 where w is a vector of
weights.
Once we add this, we can actually drop the ’k’, and just
use all instances to classify new data.
More data actually lowers classification performance.
This is sometimes called the curse of dimensionality.
This has the effect of transforming or stretching the
instance space.
More useful features have larger weights
Department of Computer Science — University of San Francisco – p.24/??
Department of Computer Science — University of San Francisco – p.25/??
Department of Computer Science — University of San Fran
Learning Attribute Weights
Unsupervised Learning
K-means Clustering
We can learn attribute weights through a hillclimbing
search.
What if we want to group instances, but we don’t know
their classes?
let w = random weights
let val(w) be the error rate for w under n-fold cross-validation
while not done :
for i in range(len(w)) :
w[i] = w[i] + delta
if val(w + w[i]) > val(w) :
keep new weights
We just want “similiar” instances to be in the same
group.
Let’s suppose we want to group our items into K
clusters.
For the moment, assume K given.
We could also use a GA or simulated annealing to do
this.
Examples:
Clustering documents based on text
Grouping users with similar preferences
Identifying demographic groups
Approach 1:
Choose K items at random. We will call these the
centers.
Each center gets its own cluster.
For each other item, assign it to the cluster that
minimizes distance between it and the center.
This is called K -means clustering.
Department of Computer Science — University of San Francisco – p.27/??
Department of Computer Science — University of San Francisco – p.28/??
Department of Computer Science — University of San Fran
K-means Clustering
K-means Clustering
Tuning the centers
To evaluate this, we measure the sum of all distances
between instances and the center of their cluster.
To evaluate this, we measure the sum of all distances
between instances and the center of their cluster.
But how do we know that we picked good centers?
But how do we know that we picked good centers?
For each cluster, find its mean.
This is the point c that minimizes the total distance to
all points in the cluster.
We don’t. We need to adjust them.
But what if some points are now in the wrong cluster?
Department of Computer Science — University of San Francisco – p.30/??
Iterate
Department of Computer Science — University of San Francisco – p.31/??
K-means pseudocode
Department of Computer Science — University of San Fran
Hierarchical Clustering
Check all points to see if they are in the correct cluster.
centers = random items
K-means produces a flat set of clusters.
If not, reassign them.
while not done :
foreach item :
assign to closest center
foreach center :
find mean of its cluster.
Each document is in exactly one cluster.
Then recompute centers.
Continue until no points change clusters.
What if we want a tree of clusters?
Topics and subtopics.
Relationships between clusters.
We can do this using hierarchical clustering
Department of Computer Science — University of San Francisco – p.33/??
Department of Computer Science — University of San Francisco – p.34/??
Department of Computer Science — University of San Fran
Hierarchical Clustering
Recommender Systems
Basic Approach
One application is in document processing.
One application of these sorts of approaches is in
recommender systems
Netflix, Amazon
A user is modeled as a vector of items she has rated.
Given a collection of documents, organize them into
clusters based on topic.
No preset list of potential categories, or labeled
documents.
Goal: Suggest items to users that they’re likely to be
interested in.
Find the closest user(s), and suggest items that similar
users liked.
Algorithm:
Real goal: For a given user, find other users she is
similiar to.
D = {d1 , d2 , ..., dn }
For every other user, compute the distance to that user.
(We might also use K-means here ahead of time)
While |D| > k :
Find the documents di and dj that are closest
according so some similarity measure.
Remove them from D
Construct a new d′ that is the “union” of di and dj and
add it to D
Department of Computer Science — University of San Francisco – p.36/??
Department of Computer Science — University of San Fran
Department of Computer Science — University of San Francisco – p.37/??
Result: a tree of categories emerges from a collection of
Advantages
Algorithmic Challenges
Practical Challenges
Computation is simple and scalable
Curse of dimensionality
How to get users to rate items?
No need to model the items themselves
Don’t need an ontology, or even any idea of what
items are.
Not all items are independent
We might want to learn weights for items, or combine
items into larger groups.
How to get users to rate truthfully?
Performs better as more data as added.
This approach tends to recommend popular items.
They’re likely to have been rated by lots of people.
Department of Computer Science — University of San Francisco – p.39/??
Department of Computer Science — University of San Francisco – p.40/??
What about new and unrated items?
What if a user is not similiar to anyone?
Department of Computer Science — University of San Fran
Summary
Instance-based learning is a very effective approach to
dealing with large numeric data sets.
k-NN can be used in supervised settings.
In unsupervised settings, k-means is a simple and
effective choice.
Most recommender systems use a form of this
approach.
Department of Computer Science — University of San Francisco – p.42/??