Download Yes - Computing Science - Thompson Rivers University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Introduction to Machine
Learning
Fall 2013
Comp3710 Artificial Intelligence
Computing Science
Thompson Rivers University
Course Outline



Part I – Introduction to Artificial Intelligence
Part II – Classical Artificial Intelligence
Part III – Machine Learning





Introduction to Machine Learning
Neural Networks
Probabilistic Reasoning and Bayesian Belief Networks
Artificial Life: Learning through Emergent Behavior
Part IV – Advanced Topics
TRU-COMP3710
Intro to Machine Learning
2
Learning Objectives






Define what classification is.
List the three types of attributes with examples.
Summarize what concept learning is.
Compute the information gain for an attribute from a given training
data set.
Construct a decision tree from a given training data set, using
information gains.
...
TRU-COMP3710
Intro to Machine Learning
3
Chapter Outline
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Introduction
Training
Rote Learning
Learning Concepts
Inductive Bias
Decision-Tree Induction
The Problem of Overfitting
The Nearest Neighbor Algorithm
Learning Neural Networks
Supervised Learning
Reinforcement Learning
Unsupervised Learning
TRU-COMP3710
Intro to Machine Learning
4
1. Introduction





Topics
[Q] What is learning:
Learning and intelligence are intimately related to each other. It is
usually agreed that a system capable of learning deserves to be called
intelligent.
[Q] Is it possible to make a machine learn?
We will discuss about the concept of learning methods.
[Q] What are we going to make a machine learn?
TRU-COMP3710
Intro to Machine Learning
5
2. Training

[Q] How do we learn?

Learning problems usually involve classifying inputs into a set of
classifications.




Learning is only possible if there is a relationship between the data and
the classifications.
[Q] What kind of relationship? How to classify?



Example of the people in this class
[Q] Can we classify us?
Based on similarity
Training involves providing the system with data which has been
manually classified.
Learning systems use the training data to learn to classify unseen data.
TRU-COMP3710
Intro to Machine Learning
6
Topics

3 data types based on comparison methods


The input data usually consist of multiple attributes.
The attributes are not all of the same type.




Numeric or numerical: E.g., 0, 1, .67, ... ; order and distance between two
values
Ordinal: E.g., Small, Medium, Large, Extra Large; only order
Categorical (Boolean can be considered this type.): English, Chines, Spanish;
not even order
Example of the people in this class









TRU-COMP3710
Age
Height
Weight
Mother tongue
Hair color
Number of legs
Number of arms
Number of nose
...
numerical, ordinal, or categorical?
?
?
?
?
?
Is it really necessary to classify us?
Intro to Machine Learning
7
3. Rote Learning




Topics
Simply involves memorizing the classifications of the training data.
A very simple learning method.
Can only classify previously seen data – unseen data cannot be
classified by a rote learner.
[Q] Is this learning type good enough? What if we have a similar data?
TRU-COMP3710
Intro to Machine Learning
8
4. Concept Learning





Concept learning involves determining a mapping from a set of input
variables to a Boolean value.
Such methods are known as inductive learning methods.
If a function can be found which maps training data to correct
classifications, then it will also work well for unseen data – hopefully!
This process is known as generalization.
A simple toy problem to determine whether driving in a particular
manner in particular road conditions is safe or not:
Attribute
Possible values
Speed
Slow, medium, fast
Type?
Weather
Wind, rain, snow, sun
Type?
Distance from car in front
10ft, 20ft, 20ft, ...
Type?
Units of alcohol driver has drunk
0, 1, 2, 3, 4, 5
Type?
Time of day
Morning, afternoon, evening, night
Temperature
TRU-COMP3710
Cold, worm,
Intro to Machine Learning
hot
9
Topics

A hypothesis (or object) is a vector (or list) of attributes:



h1 = <slow, wind, 30ft, 0, evening, cold>
h2 = <fast, rain, 10ft, 2, ?, ?> ? means “we do not care”, i.e., any value
This looks clearly untrue.
-> Negative training example
h3 = <fast, rain, 10ft, 2, , >
 means no value

In concept learning, a training hypothesis is either a positive or negative
(true or false) (or multiple classes).

Concept learning can be thought
as search through a search space that consists of hypotheses, where
the goal is the hypothesis that is most closely mapped to a given
query.
[Q] How to define “closely”?

TRU-COMP3710
Intro to Machine Learning
10
5. Inductive Bias

All learning methods have an inductive bias.





Topics
Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method.
E.g., the solution to the problem of road safety can be expressed as a
conjunction of a set of six concepts.
This does not allow for more complex expressions that cannot be
expressed as a conjunction. Therefore there can be some potential
solutions that we cannot explore.
However, without inductive bias, a learning method could not learn to
generalize.
Occam’s razor is an example of an inductive bias: The best
hypothesis to select is the simplest one.


h1 = <slow, wind, 30ft, 0, evening, cold>
h2 = <slow, wind, 30ft, ?, ?, ?>
TRU-COMP3710
Intro to Machine Learning
Which one is better?
11
6. Decision Tree Induction

Box-office success problem


Training data set: [Q] How to obtain this kind of table?
Film
Country
Big Star
Genre
Success
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 5
Europe
Yes
SF
False
Film 6
Europe
Yes
Romance
False
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
Film 9
Europe
Yes
Comedy
True
Film 10
USA
Yes
Comedy
True
Query:


TRU-COMP3710
(Thailand, Yes, Romance) -> Success?
[Q] How many comparisons when brute-force search is used? Let’s try.
Intro to Machine Learning
12






In the box-office success problem, what must be the first question?
Country, Big Start, or Genre? Film
Country
Big Star
Genre
Success
Country is a significant
determinant of whether a film
will be a success or not.
Hence the first question is
Country.
What is the next then?
Using a tree?
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 5
Europe
Yes
SF
False
Film 6
Europe
Yes
Romance
False
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
Film 9
Europe
Yes
Comedy
True
Film 10
USA
Yes
Comedy
True
[Q] How to determine which factor is the most significant
determinant?
TRU-COMP3710
Intro to Machine Learning
13

A decision tree takes an input and gives a Boolean output (or a class).
Box-office success problem:
 Decision trees can represent more complex expressions, involving
disjunctions and conjunctions.
 ((Country = USA)  (Big Star = Yes))  ((Country = Europe)  (Genre =
Comedy))
 All the objects in the training data set have classes. In this example, true
and false.
 [Q] IF-THEN rules?
 [Q] Can we use decision tree induction as an expert system?
TRU-COMP3710
Intro to Machine Learning
14


Decision tree induction involves creating a decision tree from a set
of training data that can be used to correctly classify the training data.
A query will be given without class to find the class to which the query
fits most.

[Q] How to create an efficient decision tree from a training data set?

ID3, developed in the 1980s, is an example of a decision tree learning
algorithm.
ID3 builds the decision tree from the top down, selecting the attributes
(or also called features) from the training data that provide the most
information at each stage. That is, most determinant attribute.

TRU-COMP3710
Intro to Machine Learning
15




[Q] In the box-office success problem, what must be the first question?
Country, Big Start, or Genre? Film
Country
Big Star
Genre
Success
Country is a significant
determinant of whether a film
will be a success or not.
Hence the first question is
Country.
[Q] How do we know what the
most significant determinant is?
TRU-COMP3710
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 5
Europe
Yes
SF
False
Film 6
Europe
Yes
Romance
False
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
Film 9
Europe
Yes
Comedy
True
Film 10
USA
Yes
Comedy
True
Intro to Machine Learning
16
l
l
l
l
ID3 selects attributes based on information gain.
Information gain is the reduction in entropy caused by a decision.
In information theory, entropy is a measure of the uncertainty associated
with a random variable.
Entropy of a set is defined as:
o
o
l
[Q] The most certain case?
o
l
The entropy of S is 0 when all the examples are positive, or when all the
examples are negative.
[Q] The most ambiguous case?
o
l
H(S) = - p0 log2 p0 - p1 log2 p1 - ... (if there are other classes)
p0 is the proportion of the training data that are positive (class 0) examples.
p1 is the proportion which are negative (class 1) examples.
The entropy reaches its maximum value of 1 when exactly half of the
examples are positive, and half are negative.
[Q] The smaller, the better?
TRU-COMP3710
Intro to Machine Learning
17




The information gain of a particular attribute tells us how closely the
attribute represents the entire target function,
and so at each stage, the attribute that gives the highest information
gain is chosen to turn into a question.
[Q] How to obtain the information gain of a particular attribute?
Gain = 1 –  (the weighted entropy for each value in the attribute)

Then, 0 <= Gain <= 1
TRU-COMP3710
Intro to Machine Learning
18

Country
Big Star
Genre
Success
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 5
Europe
Yes
SF
False
Film 6
Europe
Yes
Romance
False
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
Film 9
Europe
Yes
Comedy
True
Film 10
USA
Yes
Comedy
True
[Q] The information gain for Country?

H(S) =
- p0 log2 p0
- p1 log2 p1
Film

Gain = 1 – w-entropy(USA) – w-entropy(Europe) – w-entropy(Other)
For USA, there are 4 films. 3 out of 4 are True.
H(USA) = – 3/4 log2 (3/4) – (1/4) log2 (1/4) = .811
Similarly, (2/4 are True, and 2/4 are False for Europe; 2/2 are False for Other)

H(Europe) = 1; H(Other) = 0




The weights for USA, Europe and Other are 4/10, 4/10 and 2/10.
Gain = 1 – (4/10 * .811) – (4/10 * 1) – (2/10 * 0) = .2756
TRU-COMP3710
Intro to Machine Learning
19

Big Star
Genre
Success
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 5
Europe
Yes
SF
False
Film 6
Europe
Yes
Romance
False
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
Film 9
Europe
Yes
Comedy
True
Film 10
USA
Yes
Comedy
True
For Yes, there are 7 films. 4 out of 7 are True.



H(Yes) = –4/7 log2 (4/7) – (3/7) log2 (3/7) = ?
Similarly,



Country
[Q] The information gain for Big Star?

H(S) =
- p0 log2 p0
- p1 log2 p1
Film
H(No) = ?
The weights for Yes and No are 7/10 and 3/10.
Gain = 1 – (7/10 * ?) – (3/10 * ?) = .01
Gain for Genre = .17
TRU-COMP3710
Intro to Machine Learning
20






The information gain for Country= .2756
The information gain for Big Star = .01
The information gain for Genre = .17
Therefore, the attribute Country provides the greatest information gain
and so is placed at the top of the decision tree.
This method is then applied recursively to the sub-branches of the tree.
For example of USA, we need to decide the most determinant among
BigStar and Genre.
Film
Country
Big Star
Genre
Success
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
USA
Yes
Comedy
Intro
USA to Machine
Yes Learning Comedy
True
Film 3
TRU-COMP3710 Film 10
True
21

Film
Country
Big Star
Genre
Success
Film 7
Other
Yes
Comedy
False
Film 8
Other
No
SF
False
The case of Country = Other after Country is selected:

The information gain for Big Star:






Topics
For Yes, there is 1 film. 1 out 1 is False. -> H(Yes) = 0
For No, there is 1 film. 1 out 1 is False. -> H(No) = 0
Gain = 1 – 1/2 * 0 – 1/2 * 0 = 1
Similarly, the information gain for Genre = 1
Hence there is no more branch in this case. [Q] The meaning is ???
This method is then applied recursively to other sub-branches of the
tree.
[Q] No need
to check Genre?
TRU-COMP3710
Intro to Machine Learning
22
7. The Problem of Overfitting




Noise can be in the training data set
The training data do not adequately represent the entire space of possible
data.
Then, for example, a decision tree can perform poorly at classifying
unseen data.
Not only decision trees, but also other learning methods
TRU-COMP3710
Intro to Machine Learning
23
Topics




Black dots represent positive examples, white dots negative.
The two lines represent two different hypotheses.
In the first diagram, there are just a few items of training data, correctly
classified by the hypothesis represented by the darker line.
In the second and third diagrams we see the complete set of data, and
that the simpler hypothesis which matched the training data less well
matches the rest of the data better than the more complex hypothesis,
which overfits.
noise
TRU-COMP3710
Intro to Machine Learning
24
8. The Nearest Neighbor Algorithm

[Q] Any other learning method?
Film
Country
Big Star
Genre
Success
Film
Country
Big Star
Genre
Success
Film 1
USA
Yes
SF
True
Film 1
USA
Yes
SF
True
Film 2
USA
No
Comedy
False
Film 3
USA
Yes
Comedy
True
Film 3
USA
Yes
Comedy
True
Film 4
Europe
No
Comedy
True
Film 4
Europe
No
Comedy
True
Film 9
Europe
Yes
Comedy
True
Film 5
Europe
Yes
SF
False
Film 2
USA
No
Comedy
False
Film 6
Europe
Yes
Romance
False
Film 5
Europe
Yes
SF
False
Film 7
Other
Yes
Comedy
False
Film 5
Europe
Yes
SF
False
Film 8
Other
No
SF
False
Film 6
Europe
Yes
Romance
False
Film 9
Europe
Yes
Comedy
True
Film 7
Other
Yes
Comedy
False
Film 10
USA
Yes
Comedy
True
Film 8
Other
No
SF
False
Query
Thailand
Yes
Romance
???


The Nearest Neighbor algorithm is an example of instance based
learning.
Instance based learning involves storing training data and using it to
attempt to classify new data as it arrives.
TRU-COMP3710
Intro to Machine Learning
25





The nearest neighbor algorithm works with data that consists of
vectors of numerical attributes.
Each vector represents a point in n-dimensional space.
When an unseen data item is to be classified, the Euclidean distance is
calculated between this item and all training data.
For example, the distance between <x1, y1> and <x2, y2> is:
[Q] How to classify?
TRU-COMP3710
Intro to Machine Learning
26
Topics


[Q] How to classify, when k = 3?
The k-nearest neighbor algorithm:


The classification for the unseen data is usually selected as the one that is
most common amongst the k-nearest neighbors.
Shepard’s method:

This involves allowing all training data to contribute to the classification
with their contribution being proportional to their distance from the data
item to be classified.
For each class, 1/di

...

Advantage: Unlike decision tree learning, the nearest neighbor
algorithm performs very well with noisy input data.
Disadvantage: [Q] But, what if the training data set is huge? Any good
idea?

TRU-COMP3710
Intro to Machine Learning
27
9. Neural Networks
An artificial neural network is a network of artificial neurons, which
is based on the operation of the human brain.
Neural networks usually have their nodes arranged in layers.
One layer is the input layer, and another is an output layer.
There are one or more hidden layers between these two.




synapse
axon
nucleus
cell body
dendrites
TRU-COMP3710
Intro to Machine Learning
28
Topics





The connections between nodes have weights associated with them,
which determine the behavior of the network.
Input data is applied to the input layer.
Neurons fire if their inputs are above a certain level.
If one neuron is connected to another the firing of one may cause the
firing of the next.
In the next unit, we will discuss them in more detail.
TRU-COMP3710
Intro to Machine Learning
29
10. Supervised Learning

Two types of learning








Topics
Supervised, Unsupervised
Supervised learning learns by being presented with pre-classified
training data set.
[Q] Is decision tree induction supervised learning?
[Q] Is the k-nearest neighbor algorithm supervised learning?
Many neural networks use supervised learning.
Pre-classified training data is provided to the network before it is
presented with unseen data.
The training data causes the weights in the network to be set to levels
such that unseen data can be classified correctly.
Neural networks are able to learn to classify extremely complex
functions.

Demo: http://www.cbu.edu/~pong/ai/hopfield/hopfieldapplet.html

[Q] Can the positive/negative feedback information be used in training?
TRU-COMP3710
Intro to Machine Learning
30
11. Reinforcement Learning


Topics
Systems that learn using reinforcement learning are given a positive
feedback when they classify data correctly, and negative feedback
when they classify data incorrectly.
Credit assignment is needed to reward nodes in a network correctly.
TRU-COMP3710
Intro to Machine Learning
31
12. Unsupervised Learning
Topics

Unsupervised learning learns without any training data set.


Unsupervised learning networks learn without requiring human
intervention.
No training data is required.

[Q] How is that possible?

The system learns to cluster input data into a set of classifications that
are not previously defined. This is called clustering.
Clustering is a basic tool for data mining and pattern recognition.
Example: Fuzzy C-Means, EM, Kohonen Maps.
...



TRU-COMP3710
Intro to Machine Learning
32