Download Statistical classification is a procedure in which individual items are

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer simulation wikipedia , lookup

Data assimilation wikipedia , lookup

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Classification
Table of contents:
 Statistical classification
 Classification Processes
 Classification vs. Prediction
 Major Classification Models
 Evaluating Classification Methods
 Classification by decision tree induction
 Bayesian Classification
 Statistical classification
Statistical classification is a procedure in which individual items are placed into groups based
on quantitative information on one or more characteristics inherent in the items (referred to as
traits, variables, characters, etc) and based on a training set of previously labeled items.
Formally, the problem can be stated as follows: given training data
produce a classifier
which maps an object
to its classification label
. For example, if the problem is filtering spam, then
representation of an email and y is either "Spam" or "Non-Spam".
is some
Statistical classification algorithms are typically used in pattern recognition systems
 Classification—A Two-Step Process
Classification creates a GLOBAL model, that is used for PREDICTING the class
label of unknown data. The predicted class label is a CATEGORICAL attribute.
Classification is clearly useful in many decision problems, where for a given data
item a decision is to be made (which depends on the class to which the data item
belongs).
Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Classification steps:
a) Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or mathematical
formulae
b) Model usage: for classifying future or unknown objects
 The known label of test sample is compared with the classified result from the
model
 Test set is independent of training set
 If the accuracy is acceptable, use the model to classify data tuples whose class
labels are not known
In order to build a global model for classification a training set is needed from
which the model can be derived. There exist many possible models for
classification, which can be expressed as rules, decision trees or mathematical
formulae. Once the model is built, unknown data can be classified. In order to test
the quality of the model its accuracy can be tested by using a test set. If a certain set
of data is available for building a classifier, normally one splits this set into a larger
set, which is the training set, and a smaller set which is the test set.
Example:
In classification the classes are known and given by so-called class label attributes.
For the given data collection TENURED would be the class label attribute. The goal
of classification is to determine rules on the other attributes that allow to predict the
class label attribute, as the one shown right on the bottom.
In order to determine the quality of the rules derived from the training set, the test
set is used. We see that the classifier that has been found is correct in 75% of the
cases. If rules are of sufficient quality they are used in order to classify data that has
not been seen before. Since the reliability of the rule has been evaluated as 75% by
testing it against the test set and assuming that the test set is a representative sample
of all data, then the reliability of the rule applied to unseen data should be the same.
 Major Classification Models






Classification by decision tree induction
Bayesian Classification
Neural Networks
Support Vector Machines (SVM)
Classification Based on Associations
Other Classification Methods
o KNN
o Boosting
o Bagging
o …
 Evaluating Classification Methods





Predictive accuracy
Speed
o time to construct the model
o time to use the model
Robustness
o handling noise and missing values
Scalability
o efficiency in disk-resident databases
Goodness of rules
o decision tree size
o compactness of classification rules
 Classification by Decision Tree Induction
 Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on a single attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
– Tree construction
o At start, all the training samples are at the root
o Partition samples recursively based on selected attributes
– Tree pruning
o Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
A decision tree splits at each node the data set into smaller partitions, based on a test
predicate that is applied to one of the attributes in the tuples. Each leaf of the decision tree is
then associated with one specific class label. Generally a decision tree is first constructed in a
top-down manner by recursively splitting the training set using conditions on the attributes.
How these conditions are found is one of the key issues of decision tree induction. After the
tree construction it usually is the case that at the leaf level the granularity is too fine, i.e. many
leaves represent some kind of exceptional data. Thus in a second phase such leaves are
identified and eliminated. Using the decision tree classifier is straightforward: the attribute
values of an unknown sample are tested against the conditions in the tree nodes, and the class
is derived from the class of the leaf node at which the sample arrives.
A standard approach to represent the classification rules is by a decision tree. In a
decision tree at each level one of the existing attributes is used to partition the data
set based on the attribute value. At the leaf level of the classification tree then the
values of the class label attribute are found. Thus, for a given data item with
unknown class label attribute, by traversing the tree from the root to the leaf its class
can be determined. Note that in different branches of the tree, different attributes
may be used for classification. The key problem of finding classification rules is
thus to determine the attributes that are used to partition the data set at each level of
the decision tree.
Algorithm for Decision Tree Construction
The basic algorithm for decision tree induction proceeds in a greedy manner. First all samples
are at the root. Among the attributes one is chosen to partition the set. The criterion that is
applied to select the attribute is based on measuring the information gain that can be achieved,
or how much uncertainty on the classification of the samples is removed by the partitioning.
• Basic algorithm for categorical attributes (greedy)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training samples are at the root
– Examples are partitioned recursively based on test attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
– There are no samples left
• Attribute Selection Measure
– Information Gain (ID3/C4.5)
Attribute Selection Measure: Information Gain (ID3/C4.5)
Here we summarize the basic idea of how split attributes are found during the construction of a
decision tree. It is based on an information-theoretic argument. Assuming that we have a
binary category, i.e. two classes P and N into which a data collection S needs to be classified,
we can compute the amount of information required to determine the class, by I(p, n), the
standard entropy measure, where p and n denote the cardinalities of P and N. Given an attribute
A that can be used for partitioning further the data collection in the decision tree, we can
calculate the amount of information needed to classify the data after the split according to
attribute A has been performed. This value is obtained by calculating I(p, n) for each of the
partitions and weighting these values by the probability that a data item belongs to the
respective partition. The information gained by a split then can be determined as the difference
of the amount of information needed for correct classification before and after the split. Thus
we calculate the reduction in uncertainty that is obtained by splitting according to attribute A
and select among all possible attributes the one that leads to the highest reduction. On the
following we illustrate these calculations for our example.
Extracting Classification Rules from Trees
-
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in Classification

Overfitting: This is when the generated model does not apply to the new incoming data.
» Either too small of training data, not covering many cases.
» Wrong assumptions

Two approaches to avoid overfitting
- Prepruning: Halt tree construction early—do not split a node if this would result in
the goodness measure falling below a threshold
o Difficult to choose an appropriate threshold
- Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
o Use a set of data different from the training data to decide which is the “best
pruned tree”
Approaches to Determine the Final Tree Size

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation
Each run will result in a particular classification rate. for example: If we classified
50/100 of the test records, correctly our classification rate for that run is 50%.
You should choose the model that generated the highest classification rate. The final
classification rate for the model is the average of the ten classification rates.
Why decision tree induction in data mining?




relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
 Bayesian Classification
Bayesian classifier is defined by a set C of classes and a set A of attributes. A generic class
belonging to C is denoted by cj and a generic attribute belonging to A as Ai.
Consider a database D with a set of attribute values and the class label of the case.
The training of the Bayesian Classifier consists of the estimation of the conditional
probability distribution of each attribute, given the class.
Problem statement :
- Training data: examples of the form (d,h(d))
o where d are the data objects to classify (inputs)
o and h(d) are the correct class info for d, h(d){1,…K}
- Goal: given dnew, provide h(dnew)
Bayes’ Rule:
p(h | d ) 
Understand ing Bayes' rule
d  data
h  hypothesis (model)
- rearrangin g
P ( d | h) P ( h)
P(d )
Who is who in Bayes’ rule
P ( h) :
P ( d | h) :
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
the same joint probabilit y
on both sides
prior belief (probability of hypothesis h before seeing any data)
likelihood (probability of the data if the hypothesis h is true)
P(d )   P(d | h) P(h) : data evidence (marginal probability of the data)
h
P(h | d ) :
posterior (probability of hypothesis h after having seen the data d )
Naïve Bayes Classifier
What can we do if our data d has several attributes?
Naïve Bayes assumption: Attributes that describe data instances are conditionally
independent given the classification hypothesis
P(d | h)  P(a1 ,..., aT | h)   P(at | h)
t
it is a simplifying assumption, obviously it may be violated in reality, in spite of that it
works well in practice
The Bayesian classifier that uses the Naïve Bayes assumption and computes the maximum
hypothesis is called Naïve Bayes classifier
Successful applications:
 Medical Diagnosis
 Text classification
Example 1
The Evidence relates all attributes without Exceptions.
Outlook
Temp.
Humidity
Windy
Play
Sunny
Cool
High
True
?
Pr[ yes | E ]  Pr[Outlook  Sunny | yes ]
 Pr[Temperature  Cool | yes ]
 Pr[ Humidity  High | yes ]
Probability of
class “yes”
 Pr[Windy  True | yes]

Pr[ yes]
Pr[ E ]
 93  93  93  149

Pr[ E ]
2
9
Evidence E
Outlook
Ye
Sunny
s2
Overcas
4
t
Rainy
3
N
o3
0
2
Temperature
Ye No
Hot s 2
2
Mild 4
2
Cool 3
1
Humidity
Ye
s3
High
Normal 6
N
o4
1
Windy
Ye
Fals s6
e
True
3
N
o2
3
Sunny
2/9
3/5
Hot
2/9
2/5
High
3/9
4/5
False
6/9
2/5
Overcas
t
Rainy
4/9
0/5
Mild
4/9
2/5
Normal
6/9
1/5
True
3/9
3/5
3/9
2/5
Cool
3/9
1/5
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Play
Ye No
s9
5
9/14
5/14
Example 2:
1- Suppose that the training dataset excites as follow :
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
2- training dataset are classified into two main categories
 C1:buys_computer=‘yes’
 C2:buys_computer= ‘no’
3- what is the class label of the following tuple:
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
Solution:
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”