Download Classification Bayesian Classification: Why? yy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Classification
COMP 790-90
790 90 Seminar
BCB 713 Module
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bayesian
Classification: Why?
y
y
Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain types
of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that
the hypothesis holds
3
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Bayesian Theorem
Given training
g data X,, pposteriori pprobabilityy off a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
Practical difficulty: require initial knowledge of
many probabilities, significant computational cost
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Naïve Bayes Classifier
A simplified
p
assumption:
p
attributes are conditionally
y
independent:
n
P( X | C i)   P( x k | C i)
k 1
The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities
b bili i off eachh element
l
taken
k separately,
l given
i
the
h
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence
p
relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known,
known assign X to the
class with maximum P(X|Ci)*P(Ci)
5
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
yes
30 40
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31 40
31…40
>40
6
income student credit_rating
credit rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
yes excellent
y
low
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
f
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
buys
computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Naïve Bayesian Classifier:
Ex
mpl
Example
Compute
p P(X/Ci)
(
) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(i
P(income=“medium”
“ di ” | buys_computer=“yes”)=
b
“ ”) 4/9 =0.444
0 444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P( di
P(credit_rating=“fair”
i
“f i ” | buys_computer=“yes”)=6/9=0.667
b
“ ”) 6/9 0 667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|b
P(X|buys_computer=“no”)
t “ ”) * P(b
P(buys_computer=“no”)=0.007
t “ ”) 0 007
X belongs to class “buys_computer=yes”
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
7
Naïve Bayesian Classifier:
Comments
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
D
Dependencies
d i among th
these cannott be
b modeled
d l d by
b Naïve
N ï Bayesian
B
i
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Bayesian Networks
B
Bayesian
i belief
b li f network
t
k allows
ll
a subset
b t off the
th
variables conditionally independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Y
X
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency
p
y between Z and P
Has no loops or cycles
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
9
Bayesian Belief Network: An
Example
Family
History
Smoker
(FH, S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
B
Bayesian
i Belief
B li f Networks
N t
k
10
(FH, ~S) (~FH, S) (~FH, ~S)
LC
08
0.8
05
0.5
07
0.7
01
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
p
n
P ( z1,..., zn ) 
 P ( z i | Parents ( Z i ))
i 1
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Learning Bayesian Networks
Several cases
Given both the network structure and all variables
observable: learn only the CPTs
Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
p gy
Unknown structure, all hidden variables: no good
algorithms known for this purpose
D. H
D
Heckerman,
k
B
Bayesian
i networks
t
k for
f data
d t
mining
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
11
SVM – Support Vector Machines
Small Margin
Large Margin
Support Vectors
12
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
SVM – Cont
Cont.
Linear Support Vector Machine
Given a set of points xi   n with label yi  {1,1}
The SVM finds a hyperplane defined by the pair (w,b)
(where w is the normal to the plane and b is the distance
from the origin)
st
s.t.
yi ( xi  w  b)  1 i  1,..., n
x – feature vector,
vector bb- bias,
bias yy- class label,
label 2/||w|| - margin
13
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
SVM – Cont.
Cont
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
SVM – Cont.
Cont
What if the data is not linearly separable?
Project the data to high dimensional space where it is
linearly separable and then we can use linear SVM –
(Using Kernels)
(0,1) +
15
+
-
+
-1
0
+1
(0,0)
+
(1,0)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Non-Linear SVM
Classification using SVM (w,b)
?
xi  w  b  0
In non linear case we can see this as
?
K ( xi , w)  b  0
Kernel – Can be thought of as doing dot product
in some high dimensional space
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
17
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Results
18
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
SVM Related Links
http://svm dcs rhbnc ac uk/
http://svm.dcs.rhbnc.ac.uk/
http://www.kernel-machines.org/
C. J.
C
J C.
C Burges.
B rges A T
Tutorial
torial on Support
S pport Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Mining,
2(2) 1998.
2(2),
1998
SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge
g University
y Press
19
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications