Download KLASSIFIKATION UND PROGNOSE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
KLASSIFIKATION UND PROGNOSE
Teil 1:
Entscheidungsbäume and Bayesian Klassifikation
Slide 1
Skriptum zur Vorlesung 401 192 / 1
WS 2002 VO 2.0
Peter Brezany
Institut für Softwarewissenschaft
Universität Wien
Classification and Prediction
They are two forms of data analysis that can be used to extract models describing important
data classes or to predict future data trends.
Classification predicts categorical labels, prediction models continuous valued functions .
Numerisches
Attribut
Slide 2
Kategorisches
Attribut
Klassenattribut
ID
Alter
Autotyp
Risikoklasse
1
23
Familie
Hoch
2
17
Sport
Hoch
3
43
Sport
Hoch
4
68
Familie
Niedrig
5
32
LKW
Niedrig
Einordnung von Versicherungskunden in die Risikoklassen "Hoch" bzw. "Niedrig"
Categorical (nominal) attributes have a finite number of possible values, with no ordering
among the values (e.g., occupation, color, etc.).
1
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Classification and Prediction (2)
Another example: E.g., a classification model may be built to categorize bank loan
applications as either safe of risky.
Slide 3
A prediction model may be built to predict the expenditures of potential customers on
computer equipment given their income and occupation.
Most algorithms proposed so far are memory resident, typically assuming a small data size.
Recent database mining research has built on such work, developing scalable classification
and prediction technique capable of handling large disk-resident data. These techniques
often consider parallel and distributed processing.
The Data Classification Process
Data classification is a two-step process (Fig. 1). In the first process, a model is built
describing a predermined set of data classes or concepts.
The model is constructed by analyzing database tuples described by attributes. Each tuple is
assumed to belong to a predefined class, as determined by one of the attributes, called the
class label attribute. In the context of classification, data tuples are also referred to as
samples, examples, or objects.
Slide 4
The data tuples analyzed to build the model collectively form the training data set. The
individual tuples making up the training set are referred to as training samples and are
randomly selected from the same population.
Since the class label of each training sample is provided, this step is also known as
supervised learning.
In unsupervised learning (or clustering) the class label of each training sample is not
known, and the number or set of classes to be learned may not be known in advance. We
will learn more about clustering later.
The learned model is represented in the form of classification rules, decision trees, or
mathematical formulae.
2
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
The Data Classification Process (2)
In Fig. 1(a), classification rules can be learned to identify customers as having excellent or
fair credit ratings. The rules can be used to categorize future data samples. In the second
step Fig. 1(b), the model is used for classification. The predictive accuracy of the model
(classifier) is estimated.
Slide 5
A simple technique is applied that uses a test set of class-labeled samples. These are
independent from the training samples. The accuracy of a model on a given test set is the
percentage of test set samples that are correctly classified by the model; for each sample, the
known class label is compared with the learned model’s class prediction for that sample.
If the accuracy were estimated based on the training set, the estimate could be optimistic.
If the accuracy of the model is considered acceptable, the model can be used to classify
future data tuples or objects for which the class label is not known. In our example, the
classification rules learned can be used to predict the credit rating of new or future
custommers.
The Data Classification Process (3)
Classification
algorithm
Training data
Slide 6
name
Courtney Fox
Sandy Jones
Bill Lee
Susan Lake
Claire Phips
Andre Beau
...
age
income credit_rating
31...40 high
excellent
<=30
low
fair
<=30
low
excellent
>40
med
fair
>40
med
fair
31...40 high
excellent
...
...
...
Classification
rules
if age = "31...40"
and income = high
then
credit_rating = excellent
(a)
Classification
rules
Test data
name
Frank Jones
Sylvia Crest
Anne Yee
...
New data
age
income credit_rating
>40
high
fair
<=30
low
fair
31...40 high
excellent
...
...
...
(b)
Figure 1:
(John Henri, 31...40, high)
Credit rating?
excellent
3
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Comparing Classification Methods
Predictive accuracy: the ability of the model to correctly predict the class label of new
or previously unseen data.
Slide 7
Speed: the computation costs involved in generating and using the model.
Robustness: the ability of the model to make correct preconditions given noisy data or
data with missing values.
Scalability: the ability to construct the model efficiently given large amounts of data.
Interpretability: the level of understanding and insight that is provided by the model.
Classification by Decision Tree Induction
A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on
an attribute, each branch represents an outcome of the test, and leaf nodes represent classes.
Slide 8
Autotyp
= LKW
= LKW
Risikoklasse = niedrig
Alter
>60
Risikoklasse = niedrig
<=60
Risikoklasse = hoch
4
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Classification by Decision Tree Induction (2)
Another typical decision tree is shown in Fig. 2. It represents the concept buys computer - it
predicts whether or not a customer at AllElectronics is likely to purchase a computer.
Slide 9
In order to classify an unknown sample, the attribute values of the sample are tested against
the decision tree. A path is traced from the root to a leaf node that holds the class prediction
for that sample.
Decision trees can easily be converted to classification rules.
Decision trees have been used in many application areas: medicine, business, game theory,
etc.
When a decision tree is built, many of the branches may reflect noise or outliers in the
training data. Tree pruning attempts to identify and remove such branches - improving
classification accuracy on unseen data.
An Example Decision Tree
age?
<=30
Slide 10
yes
student?
no
no
>40
31...40
yes
yes
credit_rating?
excellent
no
fair
yes
A decision tree for concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class (either buys_computer = yes or buys_computer = no).
Figure 2:
5
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
6
Another Example - The Contact Lense Problem
Look at the contact lens data in the table split across the next 2 slides: this table gives the condition under
which an optician might want to prescribe soft contact lenses, etc.
Slide 11
Slide 12
spectacle
tear production
recommended
age
prescription
astigmatism
rate
lenses
-------------------------------------------------------------------------------young
myope
no
reduced
none
young
myope
no
normal
soft
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
no
reduced
none
young
hypermetrope
no
normal
soft
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
pre-presbyopic
myope
no
reduced
none
pre-presbyopic
myope
no
normal
soft
pre-presbyopic
myope
yes
reduced
none
pre-presbyopic
myope
yes
normal
hard
pre-presbyopic
hypermetrope
no
reduced
none
pre-presbyopic
hypermetrope
no
normal
soft
pre-presbyopic
hypermetrope
yes
reduced
none
pre-presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
hypermetrope
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
yes
no
no
yes
yes
no
no
yes
yes
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
none
none
none
none
hard
none
soft
none
none
-------------------------------------------------------------------------------Part of a structural description of the above table information description
might be as follows:
If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatig = no
then recommendation = soft
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Decision Tree for the Contact Lens Data
tear production rate
normal
reduced
Slide 13
astigmatism
none
yes
no
soft
spectacle prescription
myope
hard
hypermetrope
none
The Third Example - The Weather Problem
TABLE
Slide 14
------------------------------------------------------------------outlook
temperature
humidity
windy
play
-------------------------------------------------------------------sunny
hot
high
false
no
sunny
hot
high
true
no
overcast
hot
high
false
yes
rainy
mild
high
false
yes
rainy
cool
normal
false
yes
rainy
cool
normal
true
no
overcast
cool
normal
true
yes
sunny
mild
high
false
no
sunny
cool
normal
false
yes
rainy
mild
normal
false
yes
sunny
mild
normal
true
yes
overcast
mild
high
true
yes
overcast
hot
normal
false
yes
rainy
mild
high
true
no
-------------------------------------------------------------------
7
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
The Weather Problem – Rules
A set of rules learned from the information introduced in the previous
slide - not necessarily a very good set - might look like this:
Slide 15
-----------------------------------------------------If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true
then play = no
If outlook = overcast
then play = yes
If humidity = normal
then play = yes
If none of the above
then play = yes
------------------------------------------------------
The above rule are meant to be interpreted in order: the first one first, then
if it doesnt apply, the second, and so on.
The Weather Data With Some Numeric Attributes
Slide 16
outlook
temperature
humidity
windy
play
-------------------------------------------------------------------sunny
85
85
false
no
sunny
80
90
true
no
overcast
83
86
false
yes
rainy
70
96
false
yes
rainy
68
80
false
yes
rainy
65
70
true
no
overcast
64
65
true
yes
sunny
72
95
false
no
sunny
69
90
false
yes
rainy
75
80
false
yes
sunny
75
70
true
yes
overcast
72
90
true
yes
overcast
81
75
false
yes
rainy
71
91
true
no
------------------------------------------------------------------Now the first rule from the last slide might take the form:
If outlook = sunny and humidity > 83 then play = no
8
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
9
The Weather Data and Association Rules
Many association rules can be derived from the weather data. Some good
ones are
If temperature = cool
Slide 17
then huminidity = normal
If huminity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then huminidity = high
If windy = false and play = no
then outlook = sunny
and humidity = high
Divide and Conquer: Constructing decision trees
This problem can be expressed recursively:
Select an attribute to place at the root node
Slide 18
Make a branch for each possible value. This splits up the example set into subsets, one
for every value of the attribute.
Now the process can be repeated recursively for each branch, using only those
instances that actually reach the branch.
If at any time all instances at a node have the same classification, stop developing that
part of the tree.
How to determine which attribute to split on, given a set of examples with different classes?
In our weather thata, there are 4 possibilities for each split - see the next slide.
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Tree Stumps for the Weather Data
temperature
outlook
overcast
sunny
Slide 19
yes
yes
yes
no
no
yes
yes
yes
yes
yes
yes
no
no
no
rainy
(a)
hot
yes
yes
yes
yes
no
no
yes
yes
no
no
(b)
humidity
high
yes
yes
yes
no
no
no
no
(c)
cool
mild
yes
yes
yes
no
windy
normal
yes
yes
yes
yes
yes
yes
no
false
yes
yes
yes
yes
yes
yes
no
no
true
yes
yes
yes
no
no
no
(d)
Constructing decision trees (2)
If we had a measure of the purity of each node, we could choose the attribute that produces
the purest daughter nodes.
Slide 20
The measure of purity is called the information and is measured in units called bits.
Associated with a node of the tree, it represents the expected amount of information that
would be needed to specify whether a new instance should be classified yes or no, given that
the example reached that node.
The information is calculated based on the number of yes or no classes at each node; we will
look at the details of the calculation shortly.
When evaluating the first tree in the figure in the last slide, the number of yes and no classes
at the leaf nodes are [2,3], [4,0], and [3,2], and the respective information values
(entropies) are
info([2,3]) = 0.971 bits
info([4,0]) = 0.0 bits
info([3,2]) = 0.971 bits
eq. entropy(2/5,3/5) = - 2/3*log2/3 - 3/5*log3/5
10
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Constructing decision trees (3)
We calculate the average information value of these, taking into account the number of
instances that go down each branch
info([2,3], [4,0],[3,2]) = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.693 bits
Slide 21
The above average represents the amount of information that we expect would be necessary
to specify the class of a new instance, given the tree structure (a) in our example.
Before computing any of the tree structures, the training examples at the root comprised nine
yes and five no nodes, corresponding to an information value of
info([9,5]) = 0.940 bits
Thus the tree (a) is responsible for an information gain of
gain(outlook) = info([9,5]) - info([2,3], [4,0],[3,2]) = 0.940 - 0.693 = 0.247 bits,
which can be interpreted as the information value of creating a branch on the outlook
attribute.
Constructing decision trees (4)
The way forward is clear. We calculate the information for each attribute and choose the one
that gains the most information to split on. In the situation of our figure,
Slide 22
gain(outlook) = 0.247 bits
gain(temperature) = 0.029 bits
gain(humidity) = 0.152 bits
gain(windy) = 0.048 bits
so we select outlook as the splitting attribute at the root of the tree.
Then we continue recursively. Next figure shows the possibilities for a further branch at the
node reached when outlook is sunny. The information gain for each turns out to be
gain(temperature) = 0.571 bits
gain(humidity) = 0.971 bits
gain(windy) = 0.020 bits,
so we select humidity as the splitting attribute at this point.
11
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Extended Tree Stumps for the Weather Data
outlook
outlook
sunny
sunny
...
temperature
hot
no
no
Slide 23
high
no
no
no
cool
mild
yes
yes
no
...
humidity
...
...
normal
yes
yes
outlook
sunny
windy
false
yes
yes
no
no
...
...
true
yes
no
Decision Tree for the Weather Data
outlook
sunny
Slide 24
...
humidity
high
no
rainy
overcast
...
yes
normal
yes
yes
windy
...
false
true
no
12
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Calculating Information
The measure should be applicable to multiclass situations, not just to two-class one.
For example, in a 3-class situation:
Slide 25
info([2, 3, 4]) = entropy(2/9, 3/9, 4/9)
!#"$%&'()*"$%+),(-"$%&
The algorithms are expressed in base 2.
So far, we have addressed the decision tree topic in an informal way. A more formal
approach follows.
Decision Tree Construction - The Basic Algorithm
The basic algorithm for decision tree construction is an algorithm that constructs decision trees in a
top-down recursive divide-and-conquer manner.
Slide 26
Algorithm: Generate decision tree. Generate a decision tree from training data.
Input: The training samples, samples, represented by discrete-valued attributes,
attribute-list.
Output: A decision tree.
Method:
(1) create a node . ;
(2) if samples are all of the same class, /
(3)
return .
then
as a leaf node labeled with the class / ;
(4) if attribute-list is empty then
(5)
return .
as a leaf node labeled with the most common class in samples;
// majority voting
(6) select test-attribute, the attribute among attribute-list with the highest
information gaing;
13
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Decision Tree Construction - Basic Algorithm (2)
Slide 27
(7) label node . with test-attribute;
(8) for each known value of test-attribute // partition the samples
(9)
grow a branch from node . for the condition test-attribute = ;
(10)
let be the set of samples in samples for which test-attribute = ;
(11)
(12)
(13)
if
// a partition
is empty then
attach a leaf labeled with the most common class in samples;
else attach the node returned by
Generate decision tree( , attribute-list – test-attribute);
The basic strategy of the algorithm is informally explained in the next slides.
Decision Tree Construction - Basic Algorithm (3)
1. The tree starts as a single node representing the training samples (step 1).
2. If the samples are all of the same class, then the node becomes a leaf and is labeled
with that class (steps 2 and 3).
Slide 28
3. Otherwise, the algorithm uses an entropy-based measure known as information gaing
as a heuristic for selecting the attribute that will best separate the samples into
individual classes (step 6). This attribute becomes the “test” or “decision” attribute at
the node (step 7). In this version of the algorithm, all attributes are categorical, that is,
discrete-valued. Continuous-valued attributes must be discretized.
4. A branch is created for each known value of the test attribute, and the samples are
partitioned accordingly (steps 8-10).
5. The algorithm uses the same process recursively to form a decision tree for the
samples at each partition. Once an attribute has occured at a node, it need not be
considered in any of the node‘s descendents (step 13).
14
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Decision Tree Construction - Basic Algorithm (4)
6 The recursive partitioning stops only when any one of the following conditions is true:
(a) All samples for a given node belong to the same class (steps 2 and 3), or
Slide 29
(b) There are no remaining attributes on which the samples may be further partitioned
(step 4). In this case, majority voting is employed (step 5). This involves
converting the given node into a leaf and labeling it with the class in majority
among samples. Alternatively, the class distribution of the node samples may be
stored.
(c) There are no samples for the branch test-attribute = (step 11). In this case, a
leaf is created with the majority class in samples (step 12).
Attribute Selection Measure
The information gain measure is used to select the test attribute at each node in the tree - it
is also called an attribute selection measure or a measure of the goodness of split.
Slide 30
The attribute with the highest information gain (or greatest entropy reduction) is chosen as
the test attribute for the current node.
Let be a set consisting of data samples. Suppose the class label attribute has distinct
values defining distinct classes, (for i = 1, ..., m). Let be the number of samples of
in class . The expected information needed to classify a given sample is given by
! " % (Eq.1)
where is the probability that an arbitrary sample belongs to class and is estimated by
.
15
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Attribute Selection Measure (2)
Let attribute have distinct values, . Attribute can be used to partition
into subsets, , , where
contains those samples in that have value
of . If were selected as the test attribute (i.e., the best attribute for splitting), then
these subsets would correspond to the branches grown from the node containing the set .
Slide 31
Let be the number of samples of class in a subset . The entropy, or expected
information based on the partitioning into subsets by , is given by
! The term acts as the weight of the th subset and is the number of samples in
the subset (i.e., having value
of
) divided by the total number od samples in . The
smaller entropy value, the greater the purity of the subset partitions.
Attribute Selection Measure (2)
For a given subset
Slide 32
,
! "$% where and is the probability that a sample in
belongs to class .
The encoding information that would be gained by branching non
is
! ! , ! is the expected reduction in entropy caused by knowing the value of .
The algorithm computes the information gain of each attribute. The attribute with the
highest information gain is chosen as the test attribute for the given set .
16
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Example on the Induction of a Decision Tree
Table 2 (on a subsequent slide) presents a traning set taken from the AllElectronics customer
database. The class label attribute, buys computer, has two distinct values; therefore, there
are 2 distinct classes. Let corresponds to and to . There are 9 samples of class
yes and 5 samples of class no.
Slide 33
! "$% " % Next, we need to compute the entropy of each attribute.
For % !!
" # For % , !
$ % $ For % & '& $
&, (& !!
Example on the Induction of a Decision Tree (2)
The expected information needed to classify a given sample if the samples are partitioned
according to age is
%,!
Slide 34
*)
, +)
& (& !$,
Hence, the gain in information from such a partitioning would be
%,!$-.,
! %,! Similarly, we compute
*
/ 0 ! ! 2143, 56! ! /63 !
% #$07
Since %, has the highest information gain among the attributes, it is selected as the test
attribute - see Fig. 3. The final decision tree returned by the algorithm is shown in Fig. 2.
17
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Training Data Tuples from the AllElectronics DB
Table 2
Slide 35
-------------------------------------------------------------------RID
age
income
student credit_rating Class: buys_computer
-------------------------------------------------------------------1
<=30
high
no
fair
no
2
<=30
high
no
excellent
no
3
31...40 high
no
fair
yes
4
>40
medium
no
fair
yes
5
>40
low
yes
fair
yes
6
>40
low
yes
excellent
no
7
31...40 low
yes
excellent
yes
8
<=30
medium
no
fair
no
9
<=30
low
yes
fair
yes
10
>40
medium
yes
fair
yes
11
<=30
medium
yes
excellent
yes
12
31...40 medium
no
excellent
yes
13
31...40 high
yes
fair
yes
14
>40
medium
no
excellent
no
--------------------------------------------------------------------
Sample Partitioning
age ?
<=30
Slide 36
income
high
high
medium
low
medium
student
no
no
no
yes
yes
credit_rating
class
fair
excellent
fair
fair
excellent
no
no
no
yes
yes
income
high
low
medium
high
>40
31...40
student
no
yes
no
yes
income
medium
low
low
medium
medium
student
no
yes
yes
yes
no
credit_rating
class
fair
excellent
excellent
fair
yes
yes
yes
yes
credit_rating
class
fair
fair
excellent
fair
excellent
yes
yes
no
yes
no
The attribute age has the highest information gain and therefore becomes a test attribute at the root
node of the decision tree. Branches are grown for each value of age. The samples are shown
partitioned according to each branch.
Figure 3:
18
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
19
Extracting Classification Rules from Decision Trees
The knowledge represented in decision trees can be extracted and represented in the form of
classification IF-THEN rules.
Slide 37
One rule is created for each path from the root to a leaf node.
The IF-THEN rules may be easier for humans to understand, particularly if the given tree is
very large.
IF
IF
IF
IF
IF
age
age
age
age
age
=
=
=
=
=
"<=30" AND student = "no"
THEN buys_computer = "no"
"<=30" AND student = "yes"
THEN buys_computer = "yes"
"31...40"
THEN buys_computer = "yes"
">40" AND credit_rating = "excellent" THEN buys_computer = "no"
">40" AND credit_rating = "fair" THEN buys_computer = "yes"
BAYESIAN CLASSIFICATION
Bayesian classifiers are statistical classifiers.
They can predict class membership probabibilities such as the probability that a given samle
belongs to a particular class.
Slide 38
A simple Bayesian classifier known as the naive Bayesian classifier is comparable in
performance with decision tree and neural network classifiers - also, high speed and
accuracy when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes - class conditional independence.
Bayesian belief networks are graphical models which allow the representation of
dependencies among subsets of attributes.
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
20
Bayes Theorem
- a data sample whose class label is unknown - some hypothesis, such as that
, the
belongs to a specified class . For classification, we want to determine probability that holds for the given .
Slide 39
is the posterior probability, or a posteriori probability, of
conditioned on
.
Example: Suppose: the world of data samples consists of fruits, described by their color
reflects our confidence
and shape; is red and round; H = “X is an apple”. Then that is an apple given we have seen that is red and round.
is the prior probability, or a priori, of
conditioned on
.
Example: This is the probability that any given data sample is an apple, regardless of how
the data sample looks.
The posterior probability is based on more information than the prior probability.
Bayes Theorem (2)
is the posterior probability of
Example: It is the probability that
is an apple.
Slide 40
is the prior probability of
conditioned on
.
is red and round given that we know that it is true that
.
Example: It is the probability that a data sample from our set of fruits is red and round.
Bayes theorem is
In the next part, we will learn how Bayes theorem is used in the naive Bayesian classifier.
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Naive Bayesian Classification
The naive Bayesian classifier, or simple Bayesian classifier, works as follows
1. Each data sample is represented by an -dimensional feature vector,
, depicting measurements made on the sample from attributes, respectively, .
Slide 41
2. Suppose that there are classes, Given an unknown data sample,
(i.e., having no class label), the classifier will predict that belongs to the class
having the highest posterior propability, conditioned on . That is, the naive Bayesian
classifier assigns an unknown sample to the class if and only if
%
for
.
- the class is called the maximum posteriori
Thus we maximize hypothesis. By Bayes theorem,
Naive Bayesian Classification (2)
3. As is constant for all classes, only
that .
need be maximized. Note
4. Given data sets with many attributes, it would be extremely computationally expensive
. Therefore, the naive assumption of class conditional
to compute independence is made. This presumes that there are no dependence relationships
among the attributes. Thus
Slide 42
#
The probabilities ,
training samples,
where
, ..., can be estimated from the
(a) If
is categorical, then #
samples of class having the value
samples
belonging to .
, where
is the number of training
for
, and is the number of training
(b) If
is continuous-based, then the attribute is typically assumed to have a
Gaussian distribution so that
21
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Slide 43
%)
!
are the mean and standard deviation, respectively
and 5. In order to classify an unknown sample , is evaluated for each
class . Sample is then assigned to the class if and only if
%
for
.
Bayesian Classification – Example
Slide 44
Predicting a class label using naive Bayesian classification: We wish to predict the class
label of an unknown sample using naive Bayesian classification, given the same training
data as in our example for decision tree induction. The data samples are described by the
attributes age, income, student, and credit rating. The class label attribute, buys computer,
has two distinct values (namely, yes, no ). Let correspond to the class buys computer
= “yes” and correspond to buys computer= “no”. The unknown sample we wish to
classify is
X = (age= "<=30",income = "medium",
student = "yes", credit_rating = "fair")
.
We need to maximize , for class, can be computed based on the training samples:
, the prior propability of each
P(buys_computer = "yes") = 9/14 = 0,643
P(buys_computer = "no") = 5/14 = 0,357
To compute
, for , we compute the following conditional probabilities:
22
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
23
Bayesian Classification – Example (2)
Slide 45
P(age="<30" |buys_computer = "yes")
=
P (age="<30"|buys_computer = "no")
= 3/5 =
P(income = "medium" | buys_computer = "yes") =
P( income = "medium" | buys_COMPUTER = "no") =
P(student= "yes" | buys_computer="yes")
=
P(student="yes"| buys_computer = "no")
=
P(credit_rating= "fair"|buys_computer= "yes")=
P(credit_rating="fair" |buys_computer="no"
=
2/9 =0,222
0,600
4/9=0,444
2/5=0,400
6/9=0,667
1/5=0,200
6/9=0,667
2/5=0,400
Using the above probabilities, we obtain
P(X|buys_computer="yes")=0,222x0,444x0,667x0,667=0,044
P(X|buys_computer= no")=0.600 x 0.400 = 0.019
P(X|buys_computer="yes")P(buys_computer="yes") = 0.044 x 0.643 = 0.028
P(X|buys_computer = "no")P(buys_computer = "no") = 0.019 x 0.357 = 0.007
Therefore, the naive Bayesian classifier predicts buys computer = “yes” for sample
.
PREDICTION
Slide 46
What if we would like to predict a continuous value, rather than a categorical label? The
prediction of continuous valued can be modeled by a statistical techniques of regression.
For example, we may like to develop a model to predict the salary of college graduates with
10 years of work experience.
Many problem can be solved by linear regression, and even more can be tackled by applying
transformations to the variables so that a nonlinear problem can be converted to a linear one.
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Linear Regression
Data are modeled using a straight line.
A random variable, (called a response variable) is modeled as a linear function of
another random variable, (called a predictor variable)
Slide 47
and
)
are regression coefficients. They can be computed by the method of least squares.
is the average of , , and is the average of , ,
where
and is the number of samples.
Linear Regression - Example
Slide 48
Salary data
===========================================
X
Y
Years of experience
Salary (in ($1000)
------------------------------------------3
30
8
57
9
64
13
72
3
36
6
43
11
59
21
90
1
20
16
83
============================================
A plot of the data is shown in the next figure.
24
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Plot of the data in the previous table
100
Salary (in $1,000)
80
Slide 49
60
40
20
0
0
5
10
15
20
25
Years experience
Although the points do not fall on a straight line,
the overall pattern suggests a linear relationship between X (years experience) and Y (salary).
Linear Regression - Example (2)
Using the linear regression formulas, we introduced before:
5 - ,
(
(
0 ,
Slide 50
-
)
10 years of experience: $58.6K
Multiple regression is an extension of linear regression:
)
)
The method of least squares can also be applied here to solve for + ! .
25
Peter Brezany
Institut für Softwarewissenaschaft, Univ. Wien (2002)
Nonlinear Regression
Many response variable and predictor variables relationships can be modeled by polynomial
functions (we say that such relationships do not show a linear dependence).
Slide 51
By applying transformations to the variables, we can convert the nonlinear model into a
linear one which can be solved by the method of least squares.
For example:
)
)
&
)
&
&
&
Then the equation above can be converted to linear form:
)
)
)
&
&
26