* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical classification is a procedure in which individual items are
Survey
Document related concepts
Transcript
Classification Table of contents: Statistical classification Classification Processes Classification vs. Prediction Major Classification Models Evaluating Classification Methods Classification by decision tree induction Bayesian Classification Statistical classification Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items. Formally, the problem can be stated as follows: given training data produce a classifier which maps an object to its classification label . For example, if the problem is filtering spam, then representation of an email and y is either "Spam" or "Non-Spam". is some Statistical classification algorithms are typically used in pattern recognition systems Classification—A Two-Step Process Classification creates a GLOBAL model, that is used for PREDICTING the class label of unknown data. The predicted class label is a CATEGORICAL attribute. Classification is clearly useful in many decision problems, where for a given data item a decision is to be made (which depends on the class to which the data item belongs). Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Classification steps: a) Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae b) Model usage: for classifying future or unknown objects The known label of test sample is compared with the classified result from the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known In order to build a global model for classification a training set is needed from which the model can be derived. There exist many possible models for classification, which can be expressed as rules, decision trees or mathematical formulae. Once the model is built, unknown data can be classified. In order to test the quality of the model its accuracy can be tested by using a test set. If a certain set of data is available for building a classifier, normally one splits this set into a larger set, which is the training set, and a smaller set which is the test set. Example: In classification the classes are known and given by so-called class label attributes. For the given data collection TENURED would be the class label attribute. The goal of classification is to determine rules on the other attributes that allow to predict the class label attribute, as the one shown right on the bottom. In order to determine the quality of the rules derived from the training set, the test set is used. We see that the classifier that has been found is correct in 75% of the cases. If rules are of sufficient quality they are used in order to classify data that has not been seen before. Since the reliability of the rule has been evaluated as 75% by testing it against the test set and assuming that the test set is a representative sample of all data, then the reliability of the rule applied to unseen data should be the same. Major Classification Models Classification by decision tree induction Bayesian Classification Neural Networks Support Vector Machines (SVM) Classification Based on Associations Other Classification Methods o KNN o Boosting o Bagging o … Evaluating Classification Methods Predictive accuracy Speed o time to construct the model o time to use the model Robustness o handling noise and missing values Scalability o efficiency in disk-resident databases Goodness of rules o decision tree size o compactness of classification rules Classification by Decision Tree Induction Decision tree – A flow-chart-like tree structure – Internal node denotes a test on a single attribute – Branch represents an outcome of the test – Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases – Tree construction o At start, all the training samples are at the root o Partition samples recursively based on selected attributes – Tree pruning o Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree A decision tree splits at each node the data set into smaller partitions, based on a test predicate that is applied to one of the attributes in the tuples. Each leaf of the decision tree is then associated with one specific class label. Generally a decision tree is first constructed in a top-down manner by recursively splitting the training set using conditions on the attributes. How these conditions are found is one of the key issues of decision tree induction. After the tree construction it usually is the case that at the leaf level the granularity is too fine, i.e. many leaves represent some kind of exceptional data. Thus in a second phase such leaves are identified and eliminated. Using the decision tree classifier is straightforward: the attribute values of an unknown sample are tested against the conditions in the tree nodes, and the class is derived from the class of the leaf node at which the sample arrives. A standard approach to represent the classification rules is by a decision tree. In a decision tree at each level one of the existing attributes is used to partition the data set based on the attribute value. At the leaf level of the classification tree then the values of the class label attribute are found. Thus, for a given data item with unknown class label attribute, by traversing the tree from the root to the leaf its class can be determined. Note that in different branches of the tree, different attributes may be used for classification. The key problem of finding classification rules is thus to determine the attributes that are used to partition the data set at each level of the decision tree. Algorithm for Decision Tree Construction The basic algorithm for decision tree induction proceeds in a greedy manner. First all samples are at the root. Among the attributes one is chosen to partition the set. The criterion that is applied to select the attribute is based on measuring the information gain that can be achieved, or how much uncertainty on the classification of the samples is removed by the partitioning. • Basic algorithm for categorical attributes (greedy) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training samples are at the root – Examples are partitioned recursively based on test attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left • Attribute Selection Measure – Information Gain (ID3/C4.5) Attribute Selection Measure: Information Gain (ID3/C4.5) Here we summarize the basic idea of how split attributes are found during the construction of a decision tree. It is based on an information-theoretic argument. Assuming that we have a binary category, i.e. two classes P and N into which a data collection S needs to be classified, we can compute the amount of information required to determine the class, by I(p, n), the standard entropy measure, where p and n denote the cardinalities of P and N. Given an attribute A that can be used for partitioning further the data collection in the decision tree, we can calculate the amount of information needed to classify the data after the split according to attribute A has been performed. This value is obtained by calculating I(p, n) for each of the partitions and weighting these values by the probability that a data item belongs to the respective partition. The information gained by a split then can be determined as the difference of the amount of information needed for correct classification before and after the split. Thus we calculate the reduction in uncertainty that is obtained by splitting according to attribute A and select among all possible attributes the one that leads to the highest reduction. On the following we illustrate these calculations for our example. Extracting Classification Rules from Trees - Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” Avoid Overfitting in Classification Overfitting: This is when the generated model does not apply to the new incoming data. » Either too small of training data, not covering many cases. » Wrong assumptions Two approaches to avoid overfitting - Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold o Difficult to choose an appropriate threshold - Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees o Use a set of data different from the training data to decide which is the “best pruned tree” Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Each run will result in a particular classification rate. for example: If we classified 50/100 of the test records, correctly our classification rate for that run is 50%. You should choose the model that generated the highest classification rate. The final classification rate for the model is the average of the ten classification rates. Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods Bayesian Classification Bayesian classifier is defined by a set C of classes and a set A of attributes. A generic class belonging to C is denoted by cj and a generic attribute belonging to A as Ai. Consider a database D with a set of attribute values and the class label of the case. The training of the Bayesian Classifier consists of the estimation of the conditional probability distribution of each attribute, given the class. Problem statement : - Training data: examples of the form (d,h(d)) o where d are the data objects to classify (inputs) o and h(d) are the correct class info for d, h(d){1,…K} - Goal: given dnew, provide h(dnew) Bayes’ Rule: p(h | d ) Understand ing Bayes' rule d data h hypothesis (model) - rearrangin g P ( d | h) P ( h) P(d ) Who is who in Bayes’ rule P ( h) : P ( d | h) : p ( h | d ) P ( d ) P ( d | h) P ( h) P ( d , h) P ( d , h) the same joint probabilit y on both sides prior belief (probability of hypothesis h before seeing any data) likelihood (probability of the data if the hypothesis h is true) P(d ) P(d | h) P(h) : data evidence (marginal probability of the data) h P(h | d ) : posterior (probability of hypothesis h after having seen the data d ) Naïve Bayes Classifier What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis P(d | h) P(a1 ,..., aT | h) P(at | h) t it is a simplifying assumption, obviously it may be violated in reality, in spite of that it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the maximum hypothesis is called Naïve Bayes classifier Successful applications: Medical Diagnosis Text classification Example 1 The Evidence relates all attributes without Exceptions. Outlook Temp. Humidity Windy Play Sunny Cool High True ? Pr[ yes | E ] Pr[Outlook Sunny | yes ] Pr[Temperature Cool | yes ] Pr[ Humidity High | yes ] Probability of class “yes” Pr[Windy True | yes] Pr[ yes] Pr[ E ] 93 93 93 149 Pr[ E ] 2 9 Evidence E Outlook Ye Sunny s2 Overcas 4 t Rainy 3 N o3 0 2 Temperature Ye No Hot s 2 2 Mild 4 2 Cool 3 1 Humidity Ye s3 High Normal 6 N o4 1 Windy Ye Fals s6 e True 3 N o2 3 Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 Overcas t Rainy 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 3/9 2/5 Cool 3/9 1/5 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Play Ye No s9 5 9/14 5/14 Example 2: 1- Suppose that the training dataset excites as follow : income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 2- training dataset are classified into two main categories C1:buys_computer=‘yes’ C2:buys_computer= ‘no’ 3- what is the class label of the following tuple: X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) Solution: Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes”