Download Bayesian Classification: Why? Bayesian Theorem: Basics Bayes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
1/27/2012
Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
 Informally, this can be written as
posterior =likelihood x prior / evidence
 MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
Chris Clifton
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
CS490D
Bayesian Classification: Why?
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
 Probabilistic learning: Calculate explicit probabilities for
independent:
hypothesis, among the most practical approaches to
certain types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
CS490D
n
 P( x k | C i )
k 1
The product of occurrence of say 2 elements x1 and x2, given
the current class is C, is the product of the probabilities of
each element taken separately, given the same class
P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known, assign X to the class
with maximum P(X|Ci)*P(Ci)
P( X | C i ) 




2
Bayesian Theorem: Basics
CS490D
5
Training dataset
 Let X be a data sample whose class label is unknown
 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H|X): the
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
probability that the hypothesis holds given the observed
data sample X
 P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X, given that
the hypothesis holds
CS490D
4
Data sample
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
CS490D
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
6
1
1/27/2012
Naïve Bayesian Classifier:
Example
Bayesian Belief Network: An Example
Family
History
 Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
Smoker
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LungCancer
Emphysema
PositiveXRay
Dyspnea
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
Bayesian Belief Networks
CS490D
7
Naïve Bayesian Classifier: Comments
P( z1,..., zn) 
 P( zi | Parents ( Z i ))
i 1
CS490D
10
Learning Bayesian Networks
 Several cases
 Advantages :
 Given both the network structure and all variables
 Easy to implement
observable: learn only the CPTs
 Good results obtained in most of the cases
 Network structure known, some hidden variables:
 Disadvantages
method of gradient descent, analogous to neural
network learning
 Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
 Unknown structure, all hidden variables: no good
algorithms known for this purpose
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
 D. Heckerman, Bayesian networks for data mining
 How to deal with these dependencies?
 Bayesian Belief Networks
CS490D
8
Bayesian Networks
CS490D
11
Neural Networks
 Bayesian belief network allows a subset of the variables
 Analogy to Biological Systems (Indeed a great example of a
conditionally independent
good learning system)
 A graphical model of causal relationships
 Represents dependency among the variables
 Massive Parallelism allowing for computational efficiency
 Gives a specification of joint probability distribution
 The first learning algorithm came in 1959 (Rosenblatt) who
Y
X
Z
P
CS490D
suggested that if a target output value is provided for a
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
single neuron with fixed inputs, one can incrementally
change weights to learn to produce these outputs using the
perceptron learning rule
9
CS490D
12
2
1/27/2012
A Neuron
Network Training
- mk
x0
w0
x1
w1
xn
 The ultimate objective of training
 obtain a set of weights that makes almost all the tuples in the

training data classified correctly
f
 Steps
output y
wn
 Initialize weights with random values
 Feed the input tuples into the network one by one
 For each unit
Input
weight
weighted
Activation
vector x vector w
sum
function
 The n-dimensional input vector x is mapped into variable
y by means of the scalar product and a nonlinear function
mapping
CS490D
w0
x1
w1
xn
wn



Compute the net input to the unit as a linear combination of all the
inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
13
A Neuron
x0

- mk

I. H. Witten, E. Frank and
M. A. Hall
f
output y
Input
weight
weighted
vector x vector w
sum
For Example
Activation
function
n
y  sign(  wi xi  m k )
i 0
CS490D
18
14
Multilayer perceptrons
Multi-Layer Perceptron
Output vector
Errj  O j (1  O j ) Errk w jk
Output nodes
k
 j   j  (l) Errj
wij  wij  (l ) Errj Oi
Hidden nodes
Errj  O j (1  O j )(T j  O j )
wij
Input nodes
Using kernels is only one way to build nonlinear
classifier based on perceptrons
Can create network of perceptrons to approximate
arbitrary target concepts
Multilayer perceptron is an example of an artificial
neural network
Consists of: input layer, hidden layer(s), and output
layer
Structure of MLP is usually found by experimentation
Parameters can be found using backpropagation
●
Oj 
1
I
1 e j
I j   wij Oi   j
i
●
●
●
●
Input vector: xi
19
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
3
1/27/2012
Gradient descent example
Function: x2+1
●Derivative: 2x
●Learning rate: 0.1
●Start value: 4
●
Examples
Can only find a local minimum!
20
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
23
Backpropagation
●
How to learn weights given network structure?
Cannot simply use perceptron learning rule because we have
hidden layer(s)
Function we are trying to minimize: error
Can use a general function minimization technique called
gradient descent

Minimizing the error I
Need to find partial derivative of error
function for each parameter (i.e. weight)
●
𝑑𝐸
𝑑𝑤𝑖
𝑑𝑓 𝑥
𝑑𝑓 𝑥
= 𝑦−𝑓 =
𝑥𝑓 𝑥
1
𝑑𝑤𝑖
𝑑𝑥
−𝑓 𝑥
𝑥=
𝑤𝑖 𝑓 𝑥𝑖
Need differentiable activation function: use sigmoid function instead
of threshold function
●
𝑖
𝑑𝑓 𝑥
𝑑𝑤𝑖
𝑑𝐸
= 𝑓′ 𝑥 𝑓 𝑥𝑖
𝑑𝑤𝑖
= 𝑦 − 𝑓 𝑥 𝑓′ 𝑥 𝑓 𝑥𝑖
𝑓 𝑥
=
1
1 exp − 𝑥
Need differentiable error function: can't use zero-one loss, but can
use squared error
●
𝐸
21
1
2 Tools
Data Mining:
Machine
= Practical
𝑦−
𝑓 𝑥Learning
and Techniques2(Chapter 6)
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
24
The two activation
functions
Minimizing the error II
What about the weights for the connections from
the input to the hidden layer?
●
𝑑𝐸
𝑑𝐸 𝑑𝑥
=
𝑑𝑤𝑖𝑗 𝑑𝑥 𝑑𝑤𝑖𝑗
𝑥=
𝑖
𝑤𝑖=𝑓 𝑥𝑦𝑖 − 𝑓 𝑥
𝑓′ 𝑥
𝑑𝑥
𝑑𝑤𝑖𝑗
𝑑𝑥
𝑑𝑤𝑖𝑗
𝑑𝑓 𝑥𝑖
𝑥𝑖
𝑑𝑥𝑖
= 𝑤𝑑𝑓
𝑖
𝑑𝑤 = 𝑓′ 𝑥𝑖
𝑑𝑤𝑖𝑗 𝑖𝑗
𝑑𝑤𝑖𝑗
= 𝑓′ 𝑥𝑖 𝑎𝑖
𝑑𝐸
= 𝑦−𝑓 𝑥
𝑓′ 𝑥 𝑤𝑖 𝑓′ 𝑥𝑖 𝑎𝑖
𝑑𝑤𝑖𝑗
22
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
25
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
4
1/27/2012
Remarks
Same process works for multiple hidden layers and
multiple output units (eg. for multiple classes)
Can update weights after all training instances have been
processed or incrementally:
Stochastic gradient descent
●
Have seen gradient descent + stochastic
backpropagation for learning weights in a neural
network
Gradient descent is a general-purpose optimization
technique
●
●


●
batch learning vs. stochastic backpropagation
Weights are initialized to small random values
●
How to avoid overfitting?
Can be applied whenever the objective function is
differentiable
Actually, can be used even when the objective
function is not completely differentiable!

Early stopping: use validation set to check when to stop
Weight decay: add penalty term to error function

●

How to speed up learning?


Momentum: re-use proportion of old weight change
Use optimization method that employs 2nd derivative
26
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
●
Radial basis function networks
Another type of feedforward network with
two layers (plus the input layer)
●Hidden units represent points in instance
space and activation depends on distance
Stochastic gradient descent cont.
●
●
Width may be different for each hidden unit
●
Points of equal activation form hypersphere (or
hyperellipsoid) as opposed to hyperplane

●
Learning linear models using gradient
descent is easier than optimizing nonlinear NN

To this end, distance is converted into similarity:
Gaussian activation function

●
Subgradients
One application: learn linear models – e.g. linear
Data Mining: Practical Machine Learning Tools
and Techniques
(Chapter 6)
29 SVMs or logistic
regression
●
Objective function has global minimum rather
than many local minima
Stochastic gradient descent is fast, uses
little memory and is suitable for
incremental online learning
Output layer same as in MLP
27
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
30
Learning RBF networks
Parameters: centers and widths of the RBFs +
weights in output layer
Can learn two sets of parameters independently and
still get accurate models
●
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Stochastic gradient descent cont.
●
For SVMs, the error function (to be
minimized) is called the hinge loss
●
Eg.: clusters from k-means can be used to form basis
functions
Linear model can be used based on fixed RBFs
Makes learning RBFs very efficient

Disadvantage: no built-in attribute weighting based
on relevance
RBF networks are related to RBF SVMs
●
●
28
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
31
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
5
1/27/2012
Speed up, combat noise
Stochastic gradient descent cont.
●
In the linearly separable case, the hinge loss is 0
for a function that successfully separates the
data

●
●
Work incrementally
Only incorporate misclassified instances
Problem: noisy data gets incorporated


The maximum margin hyperplane is given by the
smallest weight vector that achieves 0 hinge
loss
●


32

●
Subgradient – something that resembles a
gradient
Use 0 at z = 1
In fact, loss is 0 for z  1, so can focus on z  1
Data Mining: Practical Machine Learning Tools
and Techniques
(Chapter 6)
and proceed
as usual
●
●
35
●
●
(weights can be class-specific)
Weighted Euclidean distance:
●
Noise (but: k -NN copes quite well with noise)
𝑤12 𝑥1 − 𝑦1
Class correct: increase weight
●Class incorrect: decrease weight
●Amount of change for i th attribute depends on
|xi- yi|
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
36
Learning prototypes
Only those instances involved in a
decision need to be stored
●Noisy instances should be filtered out
●Idea: only use prototypical examples
34
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
2
●
Doesn’t perform explicit generalization
●
. . . 𝑤𝑛2 𝑥𝑛 − 𝑦𝑛
●
Remedy: rule-based NN approach
33
2
Update weights based on nearest neighbor
All attributes deemed equally important
Remedy: weight attributes (or simply select)

●
IB4: weight each attribute
●
Slow (but: fast tree-based approaches exist)
Remedy: remove noisy instances

Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Weight attributes
Remedy: remove irrelevant data

Accept/reject instances
Accept if lower limit of 1 exceeds upper limit of 2
●Reject if upper limit of 1 is below lower limit of 2
●
Practical problems of 1-NN scheme:

Discard instances that don’t perform well
Compute confidence intervals for
1. Each instance’s success rate
2. Default accuracy of its class

Instance-based learning
●
IB3: deal with noise

Hinge loss is not differentiable at z = 1; cant
compute gradient!

IB2: save memory, speed up classification
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Rectangular generalizations
Nearest-neighbor rule is used outside rectangles
Rectangles are rules! (But they can be more
conservative than “normal” rules.)
●Nested rectangles are rules with exceptions
●
●
37
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
6
1/27/2012
Generalized exemplars
Numeric prediction
Counterparts exist for all schemes previously
discussed
●
Generalize instances into
hyperrectangles
●

Online: incrementally modify rectangles
Offline version: seek small set of rectangles that
cover the instances


●
●
Decision trees, rule learners, SVMs, etc.
Important design decisions:
(Almost) all classification schemes can be
applied to regression problems using
discretization
Allow overlapping rectangles?


Requires conflict resolution
●
Discretize the class into intervals
Predict weighted average of interval midpoints
Weight according to class probabilities

Allow nested rectangles?
Dealing with uncovered instances?


38
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
41
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Separating generalized exemplars
Class 1
Class
2
Separation
line
39
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
Generalized distance functions
Given: some transformation operations on attributes
K*: similarity = probability of transforming
instance A into B by
chance
●
●
Average over all transformation paths
Weight paths according their probability
(need way of measuring this)
●
●
Uniform way of dealing with different attribute types
Easily generalized to give distance between sets of
instances
●
●
40
Data Mining: Practical Machine Learning Tools
and Techniques (Chapter 6)
7