Download P( i | )

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
 Bayesian Classification
l f
 Other
Classification Methods
 Classification Accuracy
 Summary
Classification II
1
Probabilistic
P
b bili ti llearning:
i
Calculate
C l l t explicit
li it probabilities
b biliti ffor
hypothesis, among the most practical approaches to
certain types
yp of learning
gp
problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct.
t P
Prior
i k
knowledge
l d can b
be combined
bi d with
ith observed
b
d
data.
 Probabilistic prediction: Predict multiple hypotheses,
hypotheses
weighted by their probabilities
 Standard: Even when Bayesian
y
methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured

Classification II
2

Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P (h | D )  P ( D | h) P (h)
P(D)

MAP (maximum a-posteriori) hypothesis
h
 arg max P (h | D )  arg max P ( D | h) P (h).
MAP
hH
hH

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Classification II
3
 The
classification problem may be formalized using
a-posteriori
p
p
probabilities:
 P(C|X) = probability that the sample tuple
X=<x1,,…,x
, k> is of class C.
e.g. P(class=N |outlook=sunny,windy=true,…)
 Idea: assign to sample X the class label C such that
P(C|X) is maximal
Classification II
4
 Bayes
theorem:
P(C|X) = P(X|C)
P(X|C)·P(C)
P(C) / P(X)
 P(X) is constant for all classes
 P(C)
= relative
l ti ffreq. off class
l
C samples
l
 choose
C such that P(C|X) is maximum =
choose C such that P(X|C)·P(C) is maximum
 Problem:
computing P(X|C) is unfeasible!
Classification II
5
A
simplified assumption: attributes are conditionally
independent:
n
P (C j | V )  P (C j ) P ( v j | C j )
i 1
G
Greatlyy
reduces the computation
p
cost,, onlyy count
the class distribution.
Classification II
6
 Naïve
assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
 If i-th attribute is categorical:
g
P(xi|C) is estimated as the relative freq of samples
having
g value xi as i-th attribute in class C
 If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
 Computationally easy in both cases
Classification II
7
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
P(p) = 9/14
P(n) = 5/14
Windy Class
false
N
true
N
false
P
false
P
false
P
true
N
true
P
false
N
false
P
false
P
true
P
true
P
false
P
true
N
outlook
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(overcast|p) = 4/9
P(overcast|n) = 0
P( i | ) = 3/9
P(rain|p)
P( i | ) = 2/5
P(rain|n)
temperature
P(h | ) = 2/9
P(hot|p)
P(h | ) = 2/5
P(hot|n)
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 1/5
windy
y
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
 An
unseen sample X = <rain, hot, high, false>
 P(X|p)
P(X|p)·P(p)
P(p)
=
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
0 010582
 P(X|n)·P(n)
=
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
 Sample
X is classified as class n (don’t play tennis)
Classification II
9
 Bayesian
Classification
 Other Classification Methods
 Classification
Accuracy
 Summary
Classification II
10
 Advantages




p
prediction
accuracyy is g
generallyy high
g
robust, works when training examples contain errors
output
p mayy be discrete, real-valued, or a vector of
several discrete or real-valued attributes
fast evaluation of the learned target function
 Criticisms



long
g training
g time (model
(
construction time))
difficult to understand the learned function (weights)
not easyy to incorporate
p
domain knowledge
g
Classification II
11
X1 X2
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
X3
Y
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Output Y is 1 if at least two of the three inputs are equal to 1.
Classification II
12
X1 X2
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
X3
Y
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Y  I ( 0 .3 X 1  0 .3 X 2  0 .3 X 3  0 .4  0 )
 1 if the condition is true
where I ( z )  
otherwise
0
Classification II
13
Model is an assemblyy of
inter-connected nodes
and weighted links
 Output node sums up
each of its input value
according to the
weights of its links
 Compare output node
against some threshold t

Perceptron Model
Y  I (  wi X i  t )
i
or
Y  sign (  wi X i  t )
i
Classification II
14
Classification II
15



Initialize the weights (w0, w1, …, wk)
Adj t the
Adjust
th weights
i ht in
i such
h a way th
thatt th
the output
t t off NN
is consistent with class labels of training examples
It can be done by optimizing the objective function:


E   Yi  Yi

2

where Yi & Yi are the actual class label value and the
iD
predicted class value respectively of the i-th training
data of data set D
 The most well known algorithm is the backpropagation
algorithm
Classification II
16
 Several

ARCS: Quantitative association mining and clustering of
association rules (Lent et al’97)
al 97)


It beats C4.5 in (mainly) scalability and also accuracy
Associative classification: (Liu et al
al’98)
98)


methods for association-based
association based classification
It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
CAEP (Classification by aggregating emerging patterns)
(Dong et al’99)


Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another
Mine EPs based on minimum support and growth rate
Classification II
17
 Instance-based
Instance based

Store training examples and delay the processing (“lazy
evaluation”)) until a new instance must be classified
evaluation
 Typical

approaches
k-nearest
k
nearest neighbor approach


classification:
Instances represented as points in a Euclidean space.
Case-based reasoning

Uses symbolic representations and knowledge-based inference
Classification II
18
Set of Stored C ases
A tr1
… … ...
A trN
C lass
A
• Store the training records
• Use training
g records to
predict the class label of
unseen cases
B
B
C
A
Un seen C ase
… … ...
A tr1
C
B
Classification II
19
A trN

Basic idea:

If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Di
Distance
Test
Record
Choose k of the
“nearest” records
Training
Records
Classification II
20

Unknown record
Requires three things
– The
Th sett off stored
t d records
d
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
ttraining
i i records
d
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
g majority
j y vote))
taking
Classification II
21





Instance-based learning:
g lazyy evaluation
Decision-tree: eager evaluation
Key differences
 Lazy method may consider query instance xq when deciding how to
generalize beyond the training data D
 Eager method cannot since they have already chosen global
approximation when seeing the query
Efficiency:
y Lazyy - less time training
g but more time predicting
p
g
Accuracy
 Lazy method effectively uses a richer hypothesis space since it uses
many local linear functions to form its implicit global approximation to
the target function
 Eager: must commit to a single hypothesis that covers the entire
instance space
Classification II
22
 Bayesian
classification
 Other Classification Methods
 Classification Accuracy
 Summary
Classification II
23
Partition:
P
titi
T
Training-and-testing
i i
d t ti
 use two independent data sets, e.g., training set (2/3),
test set(1/3)
 used for data set with large number of samples
 Cross
Cross-validation
validation
 divide the data set into k subsamples
 use k-1 subsamples
as training
p
g data and one sub-sample
p as
test data --- k-fold cross-validation
 for data set with moderate size

Classification II
24

Cl ifi ti is
Classification
i an extensively
t i l studied
t di d problem
bl
((mainly
i l iin
statistics, machine learning & neural networks)

Classification is probably one of the most widely used data
mining techniques with a lot of extensions

Scalability is still an important issue for database applications:
thus combiningg classification with database techniques
should
q
be a promising topic

Research directions: classification of non
non-relational
relational data, e.g.,
text, spatial, multimedia, etc..
Classification II
25