Download classification - Computing and Information Studies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CLASSIFICATION
An important premise of data mining is that databases are rich with hidden information which could
be useful for decision making. The information we seek can be in many different forms and could be
used for a verity of decision making tasks. An important type of decision making task is called pattern
recognition. Pattern recognition includes a wide range of problems of practical significance; from
speech recognition to classification of postal addresses characters. The pattern recognition tasks can
be of two types: classification and prediction. Given a set of values about an object or phenomenon,
classification involves classifying it into one of two or more classes. Prediction involves predicting a
value for an unknown variable given the values of some know variables. Here we provide a brief
description of the classification task.
1. Classification model
A classification model is built form available data using the known values of the variables and
the known class. The available data is called training set and class value is called class label.
Constructing the classification model is called supervised learning. Then, given some
previously unseen data about an object or phenomenon whose class label is not known, we
use the classification model to determine its class.
There are many reasons why we may wish to set up a classification procedure or develop a
classification model.
 Such a procedure may be much faster than humans (postal code reading).
 Procedure needs to be unbiased (credit applications where humans may have bias
 Accuracy (medical diagnosis).
 Supervisor may be a verdict of experts used for developing the classification model.
2. Classifier Issues
Some issues that must be considering in developing a classifier are listed below:
Accuracy: represented as proportion of correct classifications. However, some errors may be
more serious than others and it may be necessary to control different error rates.
Speed: the speed of classification is important in some applications (real time control
systems). Sometimes there is a tradeoff between accuracy and speed.
Comprehensibility of classification model especially when humans are involved in decision
making (medical diagnosis).
Time to learn: the classification model from the training data, e.g. in a battlefield where
correct and fast classification is necessary based on available data.
3. Classification by Decision Tree Induction
A decision tree or classification tree is a tree structure, where root node is an attribute (or
feature). Internal nodes represent attributes and branches represent attribute values. The leaf
nodes are classes. A path in a tree is a traversal from the root to a leaf node and represents a
decision predicate.
Steps in developing a decision tree classifier.
i. Start with all data – if all data belong to one class, label it as a leaf node with that class.
ii. If not, use some attribute selection criterion. Suppose we use information gain. Compute
the information gain for each attribute and select that attribute which has the highest
information gain. This becomes the root node.
iii. Divide the data according to the attribute values of the root node.
iv. Repeat the above for each branch (for the data at that branch) from the root node.
v. Stop the process according to following rules:
--When all data at a node belong to one class. Then it is the label of that node.
--When all the attributes have been used, i.e., when no more attributes are available for
further partitioning. Use majority criterion for determining the leaf label. Sometimes other
criteria may be more appropriate for determining g the class label. Since such a node is not pure,
the node has a classification error. The totality of such errors is the total classification error of
the tree.
Classification accuracy
The C4.5 algorithm has functions to determine the classification accuracy of the constructed
tree. The accuracy figures are generated by the algorithm based on training data and provide
accuracy for current samples as well as for test data.
4. Tree size
When constructing a decision tree, it may contain too many branches; some of them lead to
poor classification accuracy. The C4.5 algorithm has functions for pruning the tree (reducing the
total number of nodes in the tree) and the algorithm also provides the performance of the
original and pruned trees.
5. Classification Rules form Decision Tree
The C4.5 algorithm can also generate classification rules. It re‐expresses a classification model
as production rules, a format that seems to be more intelligible that a tree. The left hand side
represents attributes values and the right hand is the classification of decision results.
6. Key Requirement for C4.5 Approach ( from Quinlin’s book)
Attribute value description
Data must be a flat file; all information about an object must be expressible in terms of fixed
collection of properties or attributes. Each attributes may be discrete or numeric but each object
must have the same attributes; attributes should not vary from object to object.
Predefined class
Categories to which cases are to be assigned must be prespecified.
Discrete Classes
Classes must be sharply delineated; a case does or does not belong to a class. For example,
cannot use C4.5 to predict temperature if categories are labeled as hard, quite hard, flexible,
soft.
Sufficient Data
Inductive generalization proceeds by identifying patterns in data. This approach fails if robust
patterns cannot distinguish from chance occurrences (of classes). Since such differentiation
depends on statistical tests, there must be sufficient data to allow these tests to be effective.
Logical Classification Model:
C4.5 constructs only classifier that can be expressed as decision trees or set of production rules.