Download Classification And Bayesian Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational phylogenetics wikipedia , lookup

Predictive analytics wikipedia , lookup

Generalized linear model wikipedia , lookup

Machine learning wikipedia , lookup

Computer simulation wikipedia , lookup

Theoretical computer science wikipedia , lookup

Data assimilation wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Classification And Bayesian
Learning
Supervisor
Prof. Dr. Mohamed Batouche
Presented By
Abdu Hassan AL- Gomai
1
Contents






Classification vs. Prediction.
Classification Step Process.
Supervised vs. Unsupervised Learning.
Major Classification Models.
Evaluating Classification Methods.
Bayesian Classification.
2
Classification vs. Prediction

What is the difference between classification and
prediction?

The decision tree is a classification model, applied to
existing data. If you apply it to new data, for which
the class is unknown, you also get a prediction of
the class. [From (
http://www.kdnuggets.com/faq/classification-vsprediction.html )].


classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data.
Typical Applications




Text Classification.
target marketing.
medical diagnosis.
treatment effectiveness analysis.
3
Classification—A Two-Step Process

Model construction: describing a set of predetermined
classes.




Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute.
The set of tuples used for model construction is training
set.
The model is represented as classification rules, decision
trees, or mathematical formula.
Model usage: for classifying future or unknown objects

Estimate accuracy of the model.




The known label of test sample is compared with the
classified result from the model.
Accuracy rate is the percentage of test set samples that
are correctly classified by the model.
Test set is independent of training set.
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known.
4
Classification Process (1): Model
Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
5
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
6
Supervised vs. Unsupervised
Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations (Teacher presents inputoutput pairs).
New data is classified based on the training set.
Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data.
7
Major Classification Models






Classification by Bayesian Classification
Decision tree induction
Neural Networks
Support Vector Machines (SVM)
Classification Based on Associations
Other Classification Methods




KNN
Boosting
Bagging
…
8
Evaluating Classification Methods


Predictive accuracy
Speed and scalability



Robustness


efficiency with respect to large data.
Interpretability:


handling noise and missing values.
Scalability


time to construct the model.
time to use the model.
understanding and insight provided by the model.
Goodness of rules

compactness of classification rules.
9
Bayesian Classification
Here we learn:
 Bayesian classification

E.g. How to decide if a patient is ill or
healthy, based on
A probabilistic model of the observed data
 Prior knowledge.

10
Classification problem


Training data: examples of the form (d,h(d))
 where d are the data objects to classify (inputs)
 and h(d) are the correct class info for d,
h(d){1,…K}
Goal: given dnew, provide h(dnew)
11
Why Bayesian?



Provides practical learning algorithms
 E.g. Naïve Bayes
Prior knowledge and observed data can be
combined
It is a generative (model based) approach,
which offers a useful conceptual
framework
 E.g. sequences could also be classified,
based on a probabilistic model
specification
 Any kind of objects can be classified,
based on a probabilistic model
specification
12
Bayes’ Rule
P ( d | h) P ( h)
p(h | d ) 
P(d )
Who is who in Bayes’ rule
P ( h) :
P ( d | h) :
Understand ing Bayes' rule
d  data
h  hypothesis (model)
- rearrangin g
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
the same joint probabilit y
on both sides
prior belief (probability of hypothesis h before seeing any data)
likelihood (probability of the data if the hypothesis h is true)
P(d )   P(d | h) P(h) : data evidence (marginal probability of the data)
h
P(h | d ) :
posterior (probability of hypothesis h after having seen the data d )
13
Naïve Bayes Classifier


What can we do if our data d has several attributes?
Naïve Bayes assumption: Attributes that describe data
instances are conditionally independent given the
classification hypothesis
P(d | h)  P(a1 ,..., aT | h)   P(at | h)
t



it is a simplifying assumption, obviously it may be violated
in reality
in spite of that, it works well in practice
The Bayesian classifier that uses the Naïve Bayes
assumption and computes the maximum hypothesis is
called Naïve Bayes classifier

One of the most practical learning methods

Successful applications:

Medical Diagnosis

Text classification
14
Naïve Bayesian Classifier: Example1
The Evidence relates all attributes without Exceptions.
Outlook
Temp.
Sunny
Cool
Humidity Windy
High
Play
True
?
Evidence E
Pr[ yes | E ]  Pr[Outlook  Sunny | yes ]
 Pr[Temperature  Cool | yes ]
Probability of
class “yes”
 Pr[ Humidity  High | yes ]
 Pr[Windy  True | yes]
Pr[ yes]

Pr[ E ]
 93  93  93  149

Pr[ E ]
2
9
15
Outlook
Temperature
Yes
Humidity
No
Yes
Windy
Yes
No
No
Sunny
2
3
Hot
2
2
High
3
4
Overcast
4
0
Mild
4
2
Normal
6
1
Rainy
3
2
Cool
3
1
Play
Yes
No
Yes
No
False
6
2
9
5
True
3
3
9/14
5/14
Sunny
2/9
3/5
Hot
2/9
2/5
High
3/9
4/5
False
6/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Normal
6/9
1/5
True
3/9
3/5
Rainy
3/9
2/5
Cool
3/9
1/5
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
True
No
High
False
Yes
Sunny
Overcast
Hot
Hot
High
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
16
Compute Prediction For New Day
Sunny
2/9
3/5
Hot
2/9
2/5
High
3/9
4/5
False
6/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Normal
6/9
1/5
True
3/9
3/5
Rainy
3/9
2/5
Cool
3/9
1/5
9/14
For compute prediction for new day:
Outlook
Temp.
Humidity
Windy
Play
Sunny
Cool
High
True
?
Likelihood of the two classes
For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
17
5/14
Naïve Bayesian Classifier: Example2
Training dataset
age
<=30
<=30
Class:
30…40
C1:buys_computer= >40
‘yes’
>40
C2:buys_computer= >40
‘no’
31…40
<=30
Data sample
<=30
X =(age<=30,
>40
Income=medium,
<=30
Student=yes
31…40
Credit_rating=
31…40
Fair)
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no 18
Naïve Bayesian Classifier: Example2

Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667
=0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) *
P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
19
Naïve Bayesian Classifier:
Advantages and Disadvantages

Advantages :



Disadvantages





Easy to implement.
Good results obtained in most of the cases.
Assumption: class conditional independence , therefore loss
of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier.
How to deal with these dependencies?

Bayesian Belief Networks.
20
References

Software: NB for classifying text:
http://www-2.cs.cmu.edu/afs/cs/project/theo11/www/naive-bayes.html

Useful reading for those interested to learn more
about NB classification, beyond the scope of this
module:
http://www-2.cs.cmu.edu/~tom/NewChapters.html.
http:// www.cs.unc.edu/Courses/comp790-090
s08/Lecturenotes.

Introduction to Bayesian Learning, School of
Computer Science, University of Birmingham,
[email protected].
21