Download START of day 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural modeling fields wikipedia , lookup

Multiple instance learning wikipedia , lookup

Time series wikipedia , lookup

Concept learning wikipedia , lookup

Machine learning wikipedia , lookup

Inductive probability wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Reading: Chap. 7
START OF DAY 3
Instance-based Learning
Introduction
• Instance-based learning is often termed lazy learning, as
there is typically no “transformation” of training
instances into more general “statements”
• Instead, the presented training data is simply stored and,
when a new query instance is encountered, a set of
similar, related instances is retrieved from memory and
used to classify the new query instance
• Hence, instance-based learners never form an explicit
general hypothesis regarding the target function. They
simply compute the classification of each new query
instance as needed
k-NN Approach
• The simplest, most used instance-based
learning algorithm is the k-NN algorithm
• k-NN assumes that all instances are points in
some n-dimensional space and defines
neighbors in terms of distance (usually
Euclidean in R-space)
• k is the number of neighbors considered
k-NN Algorithm
• For each training instance t=(x, f(x))
– Add t to the set Tr_instances
• Given a query instance q to be classified
– Let x1, …, xk be the k training instances in Tr_instances nearest to q
– Return
k
fˆ (q) = arg max å d (v, f ( xi ))
vÎV
i =1
• Where V is the finite set of target class values, and δ(a,b)=1 if
a=b, and 0 otherwise (Kronecker function)
Bias
• Intuitively, the k-NN algorithm assigns to each
new query instance the majority class among
its k nearest neighbors
– Things that look the same ought to be labeled the
same
• In practice, k is usually chosen to be odd, so as
to avoid ties
• The k = 1 rule is generally called the nearestneighbor classification rule
Decision Surface (1-NN)
Properties:
1) All possible points within a sample's Voronoi cell are the nearest neighboring
points for that sample
2) For any sample, the nearest sample is determined by the closest Voronoi cell
edge
Euclidean distance
Manhattan distance
Impact of the Value of k
q is + under 1-NN, but – under 5-NN
Distance-weighted k-NN
k
• Replace fˆ (q) = arg max å d (v, f ( xi )) by:
vÎV
i =1
k
fˆ (q) = argmax å
v ÎV
i=1
1
d( x i, x q )
2
d (v, f (x i ))
Scale Effects
• Different features may have different
measurement scales
– E.g., patient weight in kg (range [50,200]) vs.
blood protein values in ng/dL (range [-3,3])
• Consequences
– Patient weight will have a much greater influence
on the distance between samples
– May bias the performance of the classifier
Use normalization or standardization
Predicting Continuous Values
k
• Replace fˆ (q) = arg max å wid (v, f ( xi )) by:
vÎV
i =1
k
å wi f ( xi )
fˆ (q ) = i =1
k
å wi
i =1
• Where f(x) is the output value of instance x
Regression Example
3
8
5
• What is the value of the new instance?
• Assume dist(xq, n8) = 2, dist(xq, n5) = 3, dist(xq, n3) = 4
• f(xq) = (8/22 + 5/32 + 3/42)/(1/22 + 1/32 + 1/42) = 2.74/.42 =
6.5
• The denominator renormalizes the value
Some Remarks
• k-NN works well on many practical problems
and is fairly noise tolerant (depending on the
value of k)
• k-NN is subject to the curse of dimensionality,
and presence of many irrelevant attributes
• k-NN relies on efficient indexing
– Could also reduce number of instances (e.g., drop
if it would still be classified correctly)
• What distance?
Distances
These work great for
continuous
attributes
How about nominal
attributes?
How about mixtures
of continuous and
nominal attributes?
Distance for Nominal Attributes
Distance for Heterogeneous Data
Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of
Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
Incremental Learning
A learning algorithm is incremental if and only if,
for any given training samples, e1, … , en, it
produces a sequence of hypotheses (or models),
h0, h1, …, hn, such that hi+1 depends only on hi
and the current example ei
How is k-NN Incremental?
• All training instances are stored
• Model consists of the set of training instances
• Adding a new training instance only affects
the computation of neighbors, which is done
at execution time (i.e., lazily)
• Note that the storing of training instances is a violation
of the strict definition of incremental learning.
Naïve Bayes
Bayesian Learning
• A powerful and growing approach in machine
learning
• We use Bayesian reasoning in our own
decision-making all the time
– You hear a word which could equally be “Thanks”
or “Tanks”, which would you go with?
• Combine data likelihood and your prior knowledge
– Texting suggestions on phone
– Spell checkers, etc.
Bayesian Reasoning
• Bayesian reasoning provides a
probabilistic approach to inference. It
is based on the assumption that the
quantities of interest are governed by
probability distributions (priors) and
that optimal decisions can be made
by reasoning about these probabilities
together with observed data
(likelihood)
21
Example (I)
• Suppose I wish to know whether someone is
telling the truth or lying about some issue X
– The available data is from a lie detector with two
possible outcomes: truthful or liar
– I also have prior knowledge that over the entire
population, 21% of people lie about X
– Finally, I know that the lie detector is imperfect: it
returns truthful in only 94% of the cases where
people actually told the truth and liar in only 87%
of the cases where people where actually lying
22
Example (II)
• P(lies about X) = 0.21
• P(tells the truth about X) = 0.79
• P(truthful | tells the truth about X) =
0.94
• P(liar | tells the truth about X) = 0.06
• P(liar | lies about X) = 0.87
• P(truthful | lies about X) = 0.13
23
Example (III)
• Suppose a new person is asked about
X and the lie detector returns liar
• Should we conclude the person is
indeed lying about X or not
• What we need is to compare:
– P(lies about X | liar)
– P(tells the truth about X | liar)
24
Example (IV)
• P(A|B) = P(AB) / P(B)
by definition
• P(B|A) = P(AB) / P(A)
by definition
• Combining the above two
P(B|A) = P(A|B)P(B) / P(A)
• In our case:
– P(lies about X | liar) =
[P(liar | lies about X).P(lies about X)]/P(liar)
– P(tells the truth about X | liar) =
[P(liar | tells the truth about X).P(tells the truth about X)]/P(liar)
Example (V)
• All of the above probabilities we have,
except for P(liar)
• However,
– P(A) = P(A|B)P(B)+P(A|~B)P(~B)
• In our case:
Law of Total Probability
– P(liar) = P(liar | lies about X).P(lies about X) + P(liar | tells the
truth about X).P(tells the truth about X)
• And again, we know all of these
probabilities
Example (VI)
• Computing, we get:
– P(liar) = 0.87x0.21 + 0.06x0.79 = 0.23
– P(lies about X | liar) = [0.87x0.21]/0.23 = 0.79
– P(tells the truth about X | liar) = [0.06x0.79]/0.23 =
0.21
• And we would conclude that the
person was indeed lying about X
27
Bayesian Learning
•
•
Assume we have two (random)
variables, X and C
X represents multidimensional objects and C is
their class label
– C = PlayTennis = {Yes, No}
– X = Outlook x Temperature x
Humidity x Wind
We can define the following quantities:
P(C) – Prior probability
What we believe is true about C before we look at any specific instance of X
In this case, what would P(C=Yes) and P(C=No) be?
P(X|C) – Likelihood
How likely a particular instance X is given that we know its label C
In this case, what would P(X=<Rain,*,*,*>|C=No) be?
P(C|X) – Posterior probability
What we believe is true about C for a specific instance X
In this case, what would be P(C=Yes|X=<Overcast,Mild,High,Weak>) be?
This is what we are interested in!
How can we find it?
Finding P(C|X)
• We use Bayes Theorem
P(C|X) = P(X|C)P(C) / P(X)
• Allows us to talk about probabilities/beliefs even when
there is little data, because we can use the prior
– What is the probability of a nuclear plant meltdown?
– What is the probability that BYU will win the national
championship?
• As the amount of data increases, Bayes shifts
confidence from the prior to the likelihood
• Requires reasonable priors in order to be helpful
• We use priors all the time in our decision making
Remember bias: one form of prior
Probabilistic Learning
• In ML, we are often interested in determining
the best hypothesis from some space H,
given the observed training data D.
• One way to specify what is meant by the
best hypothesis is to say that we demand
the most probable hypothesis, given the
data D together with any initial knowledge
about the prior probabilities of the various
hypotheses in H.
Bayes Theorem
• Bayes theorem is the cornerstone of Bayesian
learning methods
• It provides a way of calculating the posterior
probability P(h | D), from the prior probabilities P(h),
P(D) and P(D | h), as follows:
P ( D | h) P ( h)
P ( h | D) 
P ( D)
31
MAP Learning
• How do we make our decision?
– We choose the/a maximally probable or
maximum a posteriori (MAP) hypothesis, namely:
hMAP

arg max P(h | D)
hH
P ( D | h) P ( h)
 arg max
P( D)
hH
 arg max P( D | h) P(h)
hH
Example of MAP Hypothesis
• Assume only 3 possible hypotheses in H
• Given a data set D which h do we choose?
H
Likelihood
P(D|h)
Prior
P(h)
Relative Posterior
P(D|h)P(h)
h1
.6
.3
.18
h2
.9
.2
.18
h3
.7
.5
.35
Not normalized
by P(D)
Note interaction
between prior
and likelihood
• Non-bayesian? Probably select h2
• Bayesian? Select h3 in a principled way
Remarks
• Brute-Force MAP learning algorithm:
– Often impractical (H too large)
– Answers the question: which is the most probable
hypothesis given the training data?
• Often, more significant question:
– Which is the most probable classification of the
new query instance given the training data?
• The most probable classification of a new
instance can be obtained by combining the
predictions of all hypotheses, weighted by
their posterior probabilities
Bayes Optimal Classification (I)
• If the possible classification of the new instance can
take on any value cj from some set C, then the
probability P(cj | D) that the correct classification for
the new instance is cj , is just:
P(c j | D) =
å P(c
hi ÎH

j
| hi )P(hi | D)
Law of Total Probability
Clearly, the optimal classification of the new instance is the
value cj, for which P(cj | D) is maximum, which gives rise to the
following algorithm to classify query instances.
Bayes Optimal Classification (II)
• Return argmax å P(c j | hi )P(hi | D)
c j ÎC

hi ÎH
No other classification method using the same
hypothesis space and same prior knowledge can
outperform this method on average:

It maximizes the probability that the new instance is
classified correctly, given the available data, hypothesis
space and prior probabilities over the hypotheses
Example of Bayes Optimal Classification (I)
• Assume same 3 hypotheses with priors and posteriors as shown for a
data set D with 2 possible output classes (A and B)
• Assume novel input instance x where h1 and h2 output B and h3
outputs A for x
H
Likelihood
P(D|h)
Prior
P(h)
Posterior
P(D|h)P(h)
P(A|D)
P(B|D)
h1
.6
.3
.18
0x.18 = 0
1x.18 = .18
h2
.9
.2
.18
0x.18 = 0
1x.18 = .18
h3
.7
.5
.35
1x.35 = .35
0x.35 = 0
.35
.36
Sum
Example of Bayes Optimal Classification (II)
• Assume probabilistic outputs
from the hypotheses
H
P(A|h
)
P(B|h)
h1
.3
.7
h2
.4
.6
h3
.9
.1
H
Likelihood
P(D|h)
Prior
P(h)
Posterior
P(D|h)P(h)
P(A|D)
h1
.6
.3
.18
.3x.18 = .054 .7x.18 = .126
h2
.9
.2
.18
.4x.18 = .072 .6x.18 = .108
h3
.7
.5
.35
.9x.35 = .315 .1x.35 = .035
Sum
.441
P(B|D)
.269
Naïve Bayes Learning (I)
• Large or infinite H make the above algorithm impractical still
• Naive Bayes learning is a practical Bayesian learning method:
– Applies directly to learning tasks where instances are conjunction of
attribute values and the target function takes its values from some
finite set C
– The Bayesian approach consists in assigning to a new query instance
the most probable target value, cMAP, given the attribute values a1, …,
an that describe the instance, i.e.,
cMAP = arg max P(c j | a1,… , an )
c j ÎC
Naïve Bayes Learning (II)
• Using Bayes theorem, this can be reformulated as:
cMAP
P(a1,… , an | c j )P(c j )
= arg max
c j ÎC
P(a1,… , an )
=


arg max P(a1,… , an | c j )P(c j )
c j ÎC
Finally, we make the further simplifying assumption that:
 Attribute values are conditionally independent given the
target value
Hence, one can write the conjunctive conditional probability
as a product of simple conditional probabilities
Naïve Bayes Learning (III)
n
Õ P(a | c )
• Return arg max P(c j )
c j ÎC



i
j
i=1
Naive Bayes learning method involves a learning step in which
the various P(cj) and P(ai | cj) terms are estimated:
 Based on their frequencies over the training data
These estimates are then used in the above formula to classify
each new query instance
Whenever the assumption of conditional independence is
satisfied, the naive Bayes classification is identical to the MAP
classification
NB Example (I)
Risk Assessment for Loan Applications
Client #
Credit History
Debt Level
Collateral
Income Level
RISK LEVEL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Bad
Unknown
Unknown
Unknown
Unknown
Unknown
Bad
Bad
Good
Good
Good
Good
Good
Bad
High
High
Low
Low
Low
Low
Low
Low
Low
High
High
High
High
High
None
None
None
None
None
Adequate
None
Adequate
None
Adequate
None
None
None
None
Low
Medium
Medium
Low
High
High
Low
High
High
High
Low
Medium
High
Medium
HIGH
HIGH
MODERATE
HIGH
LOW
LOW
HIGH
MODERATE
LOW
LOW
HIGH
MODERATE
LOW
HIGH
NB Example (II)
Risk Level
Credit History
Unknown
Bad
Good
Moderate
High
0.33
0.33
0.33
0.50
0.33
0.17
Low
0.40
0.00
0.60
High
Low
Moderate
High
0.33
0.67
0.67
0.33
Low
0.40
0.60
None
Adequate
Moderate
High
0.67
1.00
0.33
0.00
Low
0.60
0.40
High
Medium
Low
Moderate
High
0.33
0.00
0.67
0.33
0.00
0.67
Low
1.00
0.00
0.00
Debt Level
Collateral
Income Level
Moderate
3
6
0.21
0.43
High
Low
5
0.36
14
1.00
Consider the query instance: (Bad, Low, Adequate, Medium)
High
Moderate
Low
0.00%
1.06%
0.00%
Prediction: Moderate
Consider the query instance: (Bad, High, None, Low) - Seen
High
Moderate
Low
9.52%
0.00%
0.00%
Prediction: High
Continuous Attributes
• Can discretize into bins thus changing it
into a nominal feature and then gather
statistics normally
– How many bins? - More bins is good, but need
sufficient data to make statistically significant
bins. Thus, base it on data available
• Could also assume data is Gaussian and
compute the mean and variance for each
feature given the output class, then each
P(ai|cj) becomes 𝒩(ai|μvj, σ2vj)
Estimating Probabilities
• We have so far estimated P(X=x | Y=y) by the fraction
nx|y/ny, where ny is the number of instances for which
Y=y and nx|y is the number of these for which X=x
• This is a problem when nx is small
– E.g., assume P(X=x | Y=y)=0.05 and the training set is s.t.
that ny=5. Then it is highly probable that nx|y=0
– The fraction is thus an underestimate of the actual
probability
– It will dominate the Bayes classifier for all new queries
with X=x (i.e., drive probabilities to 0)
Laplacian or m-estimate
• Replace nx|y/ny by:
n x| y  mp
ny  m

Laplacian: m = 1/p
Where p is our prior estimate of the probability we
wish to determine and m is a constant


Typically, p = 1/k (where k is the number of possible
values of X, i.e., attribute values in our case)
m acts as a weight (similar to adding m virtual instances
distributed according to p)
How is NB Incremental?
• No training instances are stored
• Model consists of summary statistics that are
sufficient to compute prediction
• Adding a new training instance only affects
summary statistics, which may be updated
incrementally
Revisiting Conditional Independence
• Definition: X is conditionally independent of Y given
Z iff P(X | Y, Z) = P(X | Z)
• NB assumes that all attributes are conditionally
independent, given the class. Hence,
P( A1 , , An | V ) 
P( A1 | A2 , , An , V ) P( A2 ,, An | V )

P( A1 | V ) P( A2 | A3 , , An , V ) P( A3 , , An | V )
 P( A1 | V ) P( A2 | V ) P( A3 | A4 , , An , V ) P( A4 ,, An | V ))



n
 P( Ai | V )
i 1
What if ?
• In many cases, the NB assumption is
overly restrictive
• What we need is a way of handling
independence or dependence over
subsets of attributes
– Joint probability distribution
• Defined over Y1 x Y2 x … x Yn
• Specifies the probability of each variable
binding
Bayesian Belief Network
• Directed acyclic graph:
– Nodes represent variables in the joint space
– Arcs represent the assertion that the variable is
conditionally independent of its non
descendants in the network given its immediate
predecessors in the network
– A conditional probability table is also given for
each variable: P(V | immediate predecessors)
BN Examples
Homework: Weka & Incremental Learning
END OF DAY 3