Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Chapter 5: Classification 5.1 CLASSIFICATION: Definition: Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class label y. The target function is also known informally as a classification model Purpose of Classification Model: 1 ) Descriptive Modeling: A classification model can serve as an explanatory tool to distinguish between objects of different classes. Example, it would be useful—for both biologists and others—to have a descriptive model that summarizes the data which explains the features of mammals, reptiles, birds or amphibians. Name Human Python Salmon Whale Frog komodo Eel salamander Body Temperature warm-blooded cold-blooded cold-blooded warm-blooded cold-blooded cold-blooded cold-blooded cold-blooded Skin Cover Hair Scales Scales Hair None Scales Scales None Gives Birth Yes No No Yes No No No No Aquatic Creature no no yes yes semi no yes semi Aerial Creature no no no no no no no no Has Legs yes no no no yes yes no yes Hibernates no yes no no yes no no yes Class Label mammal reptile fish mammal amphibian reptile fish amphibian Table 4.1: The Vertebrate data set 2) Predicative Modeling: A classification model can also be used to predict the class label of unknown records As shown in Figure 4.2, a classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the following characteristics of a creature known as a gila monster: Has Hiber- Class Name Body Temperature Skin Gives Aquatic Aerial Birth Cover Creature Creature Legs nates Label Gila monster Cold Blooded Scales no no no yes yes ? We can use a classification model built from the data set shown in Table 4.1 to determine the class to which the creature belongs. Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 1 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5.2 GENERAL APPROACH FOR SOLVING A CLASSIFICATION PROBLEM: o A classification technique (or classifier) is a systematic approach to building classification models from an input data set. o The model generated by a learning algorithm should understand the input data well and correctly predict the class labels of records it has never seen before. o Therefore, a key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records. o Figure 4.3 shows a general approach for solving classification problems. First, a training set consisting of records whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels. Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a confusion matrix Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 2 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Table 4.2. Confusion matrix for a 2-class problem Predicted Class Class = 1 Class = 0 Actual Class Class = 1 Class = 0 f11 f01 f10 f00 o Table 4.2 depicts the confusion matrix for a binary classification problem. Each entry fij in this table denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 incorrectly predicted as class 1. o Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of incorrect predictions is (f10 + f01). Comparison of different models using performance metric: 5.3 DECISION TREE INDUCTION: Decision tree is widely used classification technique 5.3.1 HOW A DECISION TREE WORKS: o Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-mammal? o One approach is to pose a series of questions about the characteristics of the species. o The first question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a mammal. o Otherwise, it is either a bird or a mammal. o In the latter case, we need to ask a follow-up question: Do the females of the species give birth to their young? Those that do give birth are definitely mammals, while those that do not are likely to be non-mammals o The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges. Figure 4.4 shows the decision tree for the mammal classification problem. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 3 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o The tree has three types of nodes: A root node that has no incoming edges and zero or more outgoing edges. Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges. Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges. o Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test as shown in figure 4.5 Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 4 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5.3.2 HOW TO BUILD DECISION TREE: HUNT'S ALGORITHM: o In Hunt's algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets. Let Dt be the set of training records that are associated with node t and y = {y1, y2, …, yc } no the class labels. The following is a recursive definition of Hunt's algorithm: Step 1: If all the records in D, belong to the same class yt. then t is a leaf node labelled as yt Step 2: If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is then recursively applied to each child node. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 5 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o Figure 4.6, each record contains the personal information of a borrower along with a class label indicating whether the borrower has defaulted on loan payments. o The initial tree for the classification problem contains a single node with class label Defaulted = No (see Figure 4.7(a)), which means that most of the borrowers successfully Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 6 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology repaid their loans. The tree, however, needs to be refined since the root node contains records from both classes. The records are subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition o Hunt's algorithm is then applied recursively to each child of the root node. o The left child of the root is therefore a leaf node labelled Defaulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the recursive step of Hunt's algorithm until all the records belong to the same class. Design Issues of Decision Tree Induction: A learning algorithm for inducing decision trees must address the following two issues: 1. How should the training records be split? Select an attribute test condition to divide the records into smaller subsets. To implement this step: the algorithm must provide a method for specifying the test condition i) For different attribute types ii) For objective measure for evaluating the goodness of each test condition. 2. How should the splitting procedure stop? i) A stopping condition is needed to terminate the tree-growing process. ii) Continue expanding a node until either all the records belong to the same class or all the records have identical attribute values. i i i ) Other criteria can be imposed to allow the tree-growing procedure to terminate earlier. 5.3.3 METHODS FOR EXPRESSING ATTRIBUTE TEST CONDITIONS: o Decision tree induction algorithm must provide a method for expressing an attribute test condition and its corresponding outcomes for different attributes types. o Binary Attributes : The test condition for a binary attribute generates two potential outcomes, as shown in fig: Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 7 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o Nominal Attributes: o Since a nominal attribute can have many values, its test condition can be expressed in two ways, as shown in Figure 4.9. o For a multiway split (Figure 4.9(a)), the number of outcomes depends on the number of distinct values for the corresponding attribute. o For example, if an attribute such as marital status has three distinct values—single, married, or divorced—its test condition will produce a three-way split o Ordinal Attributes: o Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can be grouped as long as the grouping does not violate the order property of the attribute values. o Figure 4.10 illustrates various ways of splitting training records based on the Shirt Size attribute. The groupings shown in Figures 4.10(a) and (b) preserve the order among the attribute values, whereas the grouping shown in Figure 4.10(c) violates this property because it combines the attribute values Small and Large into the same partition while Medium and Extra Large are combined into another partition. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 8 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o Continuous Attributes: o For continuous attributes, the test condition can be expressed as a comparison test (A < v) or (A ≥ v), with binary outcomes, or a range query with outcomes of the form vi ≤ A < vi+1, for i = 1, … , k . 5.3.4 MEASURES FOR SELECTING THE BEST SPLIT: o There are many measures that can be used to determine the best way to split the records. These measures are denned in terms of the class distribution of the records before and after splitting. o Let p(i|t) denote the fraction of records belonging to class i at a given node t. o The measures developed for selecting the best split are often based on the degree of impurity of the child nodes. o The smaller the degree of impurity, the more skewed the class distribution. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 9 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o Example of Impurity Measures include: (Entropy, Gini, Classification error) 5.3.5 ALGORITHM FOR DECISION TREE INDUCTION: o A skeleton decision tree induction algorithm called TreeGrowth is shown in Algorithm 4.1. The input to this algorithm consists of the training records E and the attribute set F. The algorithm works by recursively selecting the best attribute to split the data (Step 7) and expanding the leaf nodes of the tree (step 11 and 12 ) until the stopping criterion is met (step 1). Algorithm 4.1 A skeleton decision tree induction algorithm. TreeGrowth ( E , F) 1. if stopping_cond(E,F) = true then 2. leaf = createNode() 3. leaf.label = Classify(E). 4. return leaf. 5. else 6 . root = createNode(). 7 . root.test-cond = f ind_best_split(E, F) . 8 . let V = { v \ v is a possible outcome of root.test-cond } . 9 . for each v € V do 10. Ev = { e | root.test-cond(e) = v and e € E ) . 11. child = TreeGrowth(Ev, F) . 12. add child as descendent of root and label the edge (root —> child) as v. 13. end for 14. end if 15. return root. The Details of the Algorithm are explained below: 1. The createNode() function extends the decision tree by creating a new node. A node in the decision tree has either a test condition, denoted as node.test_cond, or a class label, denoted as node.label. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 10 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 2. The find_best_split() function determines which attribute should be selected as the test condition for splitting the training records. 3. The Classify() function determines the class label to be assigned to a leaf node. 4. The stopping_cond() function is used to terminate the tree-growing process by testing whether all the records have either the same class label or the same attribute values. Another way to terminate the recursive function is to test whether the number of records has fallen below some minimum threshold. 5. After building the decision tree, a tree-pruning step can be performed to reduce the size of the decision tree 5.3.6 AN EXAMPLE: WEB ROBOT DETECTION o Web usage mining is the task of applying data mining techniques to extract useful patterns from Web access logs. o These patterns can reveal interesting characteristics of site visitors; e.g., people who repeatedly visit a Web site and view the same product description page are more likely to buy the product if certain incentives such as rebates or free shipping are offered. o In Web usage mining, it is important to distinguish accesses made by human users from those due to Web robots. o A Web robot (also known as a Web crawler) is a software program that automatically locates and retrieves information from the Internet by following the hyperlinks embedded in Web pages. Decision Tree Classification can be used to distinguish between accesses by human users and those by web Robots. o To classify the Web sessions, features are constructed to describe the characteristics of each session. o Some of the features used for the Web robot detection task are depth and Breadth o Depth determines the maximum distance of a requested page, where distance is measured in terms of the number of hyperlinks away from the entry point of the Web site. o For example, the home page http://www.cs.umn.edu/kumar is assumed to be at depth 0, o whereas http://www.cs.umn.edu/kumar/MINDS/MINDS_papers.htm is located at depth 2 o The breadth attribute measures the width of the corresponding Web graph o Here sessions used by WebRobots belongs to class-1 o Sessions used by human users belongs to class-2 Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 11 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology The model suggests that Web robots can be distinguished from human users in the following way: 1. Accesses by Web robots tend to be broad but shallow, whereas accesses by human users tend to be more focused (narrow but deep). 2. Unlike human users, Web robots seldom retrieve the image pages associated with a Web document. 3. Sessions due to Web robots tend to be long and contain a large number of requested pages. 4. Web robots are more likely to make repeated requests for the same document since the web pages retrieved by human uses are often cached by the browser. Decision Tree: depth = 1: | breadth> 7 : class 1 | breadth <= 7: | | breadth <= 3: | | | | magePages> 0.375: class 0 | | | | magePages<= 0.375: | | | | totalPages<=6: class 1 : : : Figure 4.18 Decision tree Model for Web robot detection 5.3.7 CHARACTERISTICS OF DECISION TREE INDUCTION: 1. Decision tree induction is a non-parametric approach for building classification models. 2. Finding an optimal decision tree is| an NP-complete problem. Many decision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space. 3. Techniques developed for constructing decision trees are computationally inexpensive, making it possible to quickly construct models even when the training set size is very large. 4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of the trees are also comparable to other classification techniques for many simple data sets. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 12 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5. Decision trees provide an expressive representation for learning discrete-valued functions. 6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for avoiding over-fitting. 7. The presence of redundant attributes does not adversely affect the accuracy of decision trees. 8. Since most decision tree algorithms employ a top-down, recursive partitioning approach, the number of records becomes smaller as we traverse down the tree. 9. A sub-tree can be replicated multiple times in a decision tree, as given in figure-4.19 the algorithm solves the sub-trees by using the divide and conquer algorithm to avoid complexity. 10. Studies have shown that the choice of impurity measure has little effect on the performance of decision tree induction algorithms. 5.4 RULE BASED CLASSIFIER: A rule-based classifier is a technique for classifying records using a collection of "if....then...rules. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 13 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Rule-based Classifier (Example): Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Blood Type warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Live in Water no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no Class mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds Table above shows an example of a model generated by a rule-based classifier for the vertebrate classification problem. Example of a rule set for the vertebrate classification problem. R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Rule: (Condition) y – where Condition is a conjunctions of attributes y is the class label – LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 14 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Application of Rule-Based Classifier: A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule Name hawk grizzly bear Blood Type Give Birth Can Fly Live in Water Class no yes yes no no no ? ? warm warm The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal Rule Coverage and Accuracy: Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule (Status=Single) No Coverage = 40%, Accuracy = 50% Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 5.4.1 HOW DOES RULE-BASED CLASSIFIER WORK? A rule-based classifier classifies a test record based on the rule triggered by the record. To illustrate how a rule-based classifier works, consider the rule set given below R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 15 Department of MCA Name lemur turtle dogfish shark Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Blood Type warm cold cold Give Birth Can Fly Live in Water Class yes no yes no no no no sometimes yes ? ? ? A lemur triggers rule R3, so it is classified as a mammal The second vertebrate, which is a turtle, triggers the rules r4 and r5. Since the classes predicted by the rules are contradictory (reptiles versus amphibians), their conflicting classes must be resolved. None of the rules are applicable to a dogfish shark. In this case, we need to ensure that the classifier can still make a reliable prediction even though a test record is not covered by any rule. Characteristics of Rule-Based Classifier: (The above example table of lemur, turtle, dogfish illustrates two important properties of the rule set generation i.e. Mutually exclusive rule and Exhaustive rule) Mutually exclusive rules: – Classifier contains mutually exclusive rules if the rules are independent of each other – Every record is covered by at most one rule. – Example lemur and dogfish is mutually exclusive as it triggers only one rule or no rule, but turtle is not mutually exclusive as it triggers more than one rule i.e. R 4, R5 Exhaustive rules: – Classifier has exhaustive coverage if it accounts for every possible combination of attribute values – Each record is covered by at least one rule – Ex: In the above example Lemur and turtle are Exhaustive as it triggers at least one rule, whereas Dogfish is not exhaustive as it does not triggers any rule. Note: Together these properties ensure that every record is covered by exactly one rule. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 16 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Refund Classification Rules Yes No NO Marita l Status {Single, Divorced} (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No {Married} NO Taxable Income < 80K (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes > 80K NO (Refund=No, Marital Status={Married}) ==> No YES Here Rules are mutually exclusive and exhaustive Note: Rule set contains as much information as the tree Rules Can Be Simplified: Tid Refund Marital Status Taxable Income Cheat No 1 Yes Single 125K No Marita l Status 2 No Married 100K No {Married} 3 No Single 70K No NO 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes NO {Single, Divorced} Taxable Income < 80K NO > 80K 60K YES Initial Rule: (Refund=No) (Status=Married) No Simplified Rule: (Status=Married) No 10 Effect of Rule Simplification: If a rule is not mutually exclusive then – A record may trigger more than one rule – Solution? Ordered rule set Unordered rule set – use voting schemes Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 17 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology If a rule is not exhaustive then – A record may not trigger any rules – Solution? Use a default class default rule, Rd: () yd, must be added to cover the remaining cases. A default rule has an empty antecedent and is triggered when all other rules have failed. Ordered Rule Set: In this approach, the rules in a rule set are ordered in decreasing order of their priority, which can be defined in many ways {e.g.,. based on accuracy, coverage, total description length, or the order in which the rules are generated). An ordered rule set is also known as a decision list Ex: Rules are ordered based on the way they are generated. R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name turtle Blood Type cold Give Birth Can Fly Live in Water Class no no sometimes ? Unordered Rules: This approach allows a test record to trigger multiple classification rules and considers the consequent of each rule as a vote for a particular class. The votes are then tallied to determine the class label of the test record. The record is usually assigned to the class that receives the highest number of votes. In some cases, the vote may be weighted by the rule's accuracy. Advantages: o Unordered rules are less susceptible to errors caused by the wrong rule being selected to classify a test record. o Model building is also less expensive because the rules do not have to be kept in sorted order. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 18 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Disadvantages: o Classifying a test record can be quite an expensive task because the attributes of the test record must be compared against the precondition of every rule in the rule set. 5.4.2 RULE ORDERING SCHEMES: Rule ordering can be implemented on a rule-by-rule basis or on a class-by-class basis. Rule-Based Ordering Scheme o This ordering scheme ensures that every test record is classified by the "best" rule covering it. o Individual rules are ranked based on their quality Class-based ordering o Rules that belong to the same class appear together. o In this approach, rules that belong to the same class appear together in the rule set R. o The rules are then collectively sorted on the basis of their class information. Comparison Between Rule Based and Class Based Ordering Schemes: Rule-based Ordering Class-based Ordering (Refund=Yes) ==> No (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No (Refund=No, Marital Status={Married}) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes 5.4.3 HOW TO BUILD A RULE BASED CLASSIFIER: There are two broad classes of methods for extracting classification rules: (1) Direct method, which extract classification rules directly from data, and (2) Indirect method, which extract classification rules from other classification models, such as decision trees and neural networks Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 19 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5.4.3.1 DIRECT METHODS FOR RULE EXTRACTION: The sequential covering algorithm is often used to extract rules directly from data. The algorithm extracts the rules one class at a time for data sets that contain more than two classes. For the vertebrate classification problem, the sequential covering algorithm may generate rules for classifying birds first, followed by rules for classifying mammals, amphibians, reptiles, and finally, fishes The criterion for deciding which class should be generated first depends on a number of factors, such as fraction of training records that belong to a particular class or the cost of misclassifying records from a given class Sequential Covering Algorithm: 1. Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove training records covered by the rule 4. Repeat Step (2) and (3) until stopping criterion is met During rule extraction, all training records for class” y” is considered to be positive examples, while those that belong to other classes are considered to be negative examples. A rule is desirable if it covers most of the positive examples and none (or very few) of the negative examples. Once such a rule is found, the training records covered by the rule are eliminated. The new rule is added to the bottom of the decision list R. This procedure is repeated until the stopping criterion is met. The algorithm then proceeds to generate rules for the next class. Figure 5.2 demonstrates how the sequential covering algorithm works for a data set that contains a collection of positive and negative examples. The rule R1, whose coverage is shown in Figure 5.2(b), is extracted first because it covers the largest fraction of positive examples. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 20 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology All the training records covered by Rl are subsequently removed and the algorithm proceeds to look for the next best rule, which is R2. Steps Followed to Build Rule In Detail: 1. Learn-One-Rule Function o The objective of the Learn-One-Rule function is to extract a classification rule that covers many of the positive examples and none (or very few) of the negative examples in the training set. However, finding an optimal rule is computationally expensive given the exponential size of the search space. 2. Rule Growing Strategy: o There are two common strategies for growing a classification rule: General-to-specific or Specific-to-general o Under the general-to-specific strategy, an initial rule r: {} —• y is created, where the left-hand side is an empty set and the right-hand side contains the target class. The rule has poor quality because it covers all the examples in the training set. o New conjuncts are subsequently added to improve the rule's quality. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 21 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology \\\\\\\\\\\\\\\\\\\\\ General-to-specific rule-growing strategy for the vertebrate classification problem: o The conjunct Body Temperature = warm-blooded is initially chosen to form the rule antecedent. o The algorithm then explores all the possible candidates like Gives-Birth=yes. o This process continues until the stopping criterion is met. Specific-to-general strategy rule-growing strategy for the vertebrate classification problem: o Suppose a positive example for mammals is chosen as the initial seed. o To improve its coverage, the rule is generalized by removing the conjunct Hibernate=no. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 22 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o The refinement step is repeated until the stopping criterion is met, e.g., when the rule starts covering negative examples. 3. Rule Evaluation: o An evaluation metric is needed to determine which conjunct should be added (or removed) during the rule-growing process. Example, consider a training set that contains 60 positive examples and 100 negative examples. Suppose we are given the following two candidate rules: Rule R1: covers 50 positive examples and 5 negative examples, Rule R2: covers 2 positive examples and no negative examples. o The accuracies for R1 and R2 are 90.9% and 100%, respectively. o However, R 1 is the better rule despite its lower accuracy. o The high accuracy for R2 is potentially spurious because the coverage of the rule is too low. 4. Rule Pruning: o The Rule generated by the Learn-One-Rule function can be pruned to improve their generalization errors. Ex: If the error on validation set decreases after pruning, we should keep the simplified rule. 5. Rationale for sequential covering: o After a rule is extracted, the sequential covering algorithm must eliminate all the positive and negative examples covered by the rule. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 23 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology The Rational for doing this is given below: o Figure above shows three possible rules, R1, R2, R3, extracted from a data set that contains 29 positive examples and 21 negative examples. o The accuracies of R1, R2, and R3 are 12/15 (80%), 7/10 (70%), 8/12 (66.7%) respectively. o R1 is generated first as it has the highest accuracy. 5.4.3.2 INDIRECT METHOD FOR RULE EXTRACTION: It is a method for generating a rule set from a decision tree. In principle, every path from the root node to the leaf node of a decision tree can be expressed as a classification rule P No Yes Q No - Rule Set R Yes No Yes + + Q No Lecturer: Syed Khutubuddin Ahmed Yes r1: (P=No,Q=No) ==> r2: (P=No,Q=Yes) ==> + r3: (P=Yes,R=No) ==> + r4: (P=Yes,R=Yes,Q=No) ==> r5: (P=Yes,R=Yes,Q=Yes) ==> + + Contact: [email protected] Page 24 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Example: Consider the following three rules from above Figure: r2 : (P = No) ^ (Q = Yes) + r3 : (P = Yes) ^ (R = No) + r5 : (P = Yes) ^ R = Yes) A (Q = Yes) + Observe that the rule set always predicts a positive class when the value of Q is Yes. Therefore, we may simplify the rules as follows: r2': (Q = Yes) + r3: (P = Yes) ^ (R = No) + r3 is retained to cover the remaining instances of the positive class. Although the rules obtained after simplification are no longer mutually exclusive, they are less complex and are easier to interpret Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 25 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Rule Generation: o Classification rules are extracted for every path from the root to one of the leaf nodes in the decision tree. o Given a classification rule r : A —> y, we consider a simplified rule, r' : A' —► y, where A' is obtained by removing one of the conjuncts in A. Ex: r2 : (P = No) ^ (Q = Yes) + is simplified to r2': (Q = Yes) + o The simplified rule with the lowest pessimistic error rate is retained provided its error rate is less than that of the original rule. o The rule-pruning step is repeated until the pessimistic error of the rule cannot be improved further. o Because some of the rules may become identical after pruning, the duplicate rules must be discarded. Rule Ordering: o After generating the rule set, rules uses the class-based ordering scheme to order the extracted rules. o Rules that predict the same class are grouped together into the same subset. o The total description length for each subset is computed, and the classes are arranged in increasing order of their total description length. o The class that has the smallest description is given the highest priority because it is expected to contain the best set of rules. Advantages of Rule-Based Classifiers o highly expressive as decision trees o Easy to interpret o Easy to generate o Can classify new instances rapidly o Performance comparable to decision trees Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 26 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5.5 NEAREST-NEIGHBOR CLASSIFIERS: The classification framework shown in fig involves a two-step process: 1. an inductive step for constructing a classification model from data, and 2. a deductive step for applying the model to test examples Decision tree and Rule-based classifiers are examples of Eager learners because they are designed to learn a model that maps the input attributes to the class label as soon as the training data becomes available. An opposite strategy would be to delay the process of modeling the training data until it is needed to classify the test examples. Techniques that employ this strategy are known as lazy learners. An example of a lazy learner is the Rote classifier, which memorizes the entire training data and performs classification only if the attributes of a test instance match one of the training examples exactly. An obvious drawback of this approach is that some test records may not be classified because they do not match any training example. One way to make this approach more flexible is to find all the training examples that are relatively similar to the attributes of the test example. These examples, which are known as nearest neighbors, can be used to determine the class label of the test example. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 27 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology The justification for using nearest neighbors is best exemplified by the following saying: Basic idea: "If it walks like a duck, quacks like a duck, and looks like a duck, then it's probably a duck." Compute Distance Training Records Test Record Choose k of the “nearest” records A nearest-neighbor classifier represents each example as a data point in a d-dimensional space, where d is the number of attributes. Given a test example, we compute its proximity to the rest of the data points in the training set, using one of the proximity measures. Figure 5.7 illustrates the 1-, 2-, and 3-nearest neighbors of a data point located at the center of each circle. The data point is classified based on the class labels of its neighbors. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 28 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology In the case where the neighbors have more than one label, the data point is assigned to the majority class of its nearest neighbors. In Figure 5.7(a), the 1-nearest neighbor of the data point is a negative example. Therefore the data point is assigned to the negative class. If the number of nearest neighbors is three, as shown in Figure 5.7(c), then the neighborhood contains two positive examples and one negative example. Using the majority voting scheme, the data point is assigned to the positive class. In the case where there is a tie between the classes (see Figure 5.7(b)), we may randomly choose one of them to classify the data point. The preceding discussion underscores the importance of choosing the right value for k. If k is too small, then the nearest-neighbor classifier may be susceptible to over-fitting because of noise in the training data. On the other hand, if k is too large, the nearest-neighbor classifier may misclassify the test instance because its list of nearest neighbors may include data points that are located far away from its neighborhood (see Figure 5.8). Characteristics of Nearest-Neighbor Classifiers The characteristics of the nearest-neighbor classifier are summarized below: Nearest-neighbor classification is part of a more general technique known as instance-based learning, which uses specific training instances to make predictions without having to maintain an abstraction (or model) derived from data. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 29 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Lazy learners such as nearest-neighbor classifiers do not require model building. However, classifying a test example can be quite expensive because we need to compute the proximity values individually between the test and training examples. Nearest-neighbor classifiers make their predictions based on local information, whereas decision tree and rule-based classifiers attempt to find a global model that fits the entire input space. Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries. Such boundaries provide a more flexible model representation compared to decision tree and rulebased classifiers Nearest-neighbor classifiers can produce wrong predictions unless the appropriate proximity measure and data preprocessing steps are taken. For example, suppose we want to classify a group of people based on attributes such as height (measured in meters) and weight (measured in pounds). The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250 lb. If the scale of the attributes is not taken into consideration, the proximity measure may be dominated by differences in the weights of a person. 5.6 BAYESIAN CLASSIFIERS: In many applications the relationship between the attribute set and the class variable is nondeterministic. In other words, the class label of a test record cannot be predicted with certainty even though its attribute set is identical to some of the training examples. This situation may arise because of noisy data. For example, consider the task of predicting whether a person is at risk for heart disease based on the person's diet and workout frequency. Although most people who eat healthily and exercise regularly have less chance of developing heart disease, they may still do so because of other factors such as heredity, excessive smoking, and alcohol abuse. Determining whether a person's diet is healthy or the workout frequency is sufficient is also subject to interpretation, which in turn may introduce uncertainties into the learning problem. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 30 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Bayes theorem is a statistical principle for combining prior knowledge of the classes with new evidence gathered from data. 5.6.1 Bayes Theorem: Consider a football game between two rival teams: Team 0 and Team 1. Suppose Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the games won by Team 0, only 30% of them come from playing on Team 1 's football field. On the other hand, 75% of the victories for Team 1 are obtained while playing at home. If Team 1 is to host the next match between the two teams, which team will most likely emerge as the winner? This question can be answered by using the well-known Bayes theorem To solve this problem we may use the concept of probability. Formula of Bayes theorem: The Bayes theorem can be used to solve the prediction problem stated at the beginning of this section. Let X be the random variable that represents the team hosting the match Y be the random variable that represents the winner of the match. Both X and Y can take on values from the set {0,1}. We can summarize the information given in the problem as follows: Probability Team 0 wins is P ( Y = 0) = 0.65. Probability Team 1 wins is P ( Y = 1 ) = 1 - P ( Y = 0 ) = 0.35. Probability Team 1 hosted the match it won is P ( X = l | Y = 1 ) = 0 . 7 5 . Probability Team 1 hosted the match won by Team 0 is P ( X = 1|Y = 0) = 0.3. Our objective is to compute P ( Y = 1 | X = 1), which is the conditional probability that Team 1 wins the next match it will be hosting, and compares it against P ( Y = 0 | X = 1). Using the Bays theorem we obtain P(Y=1 | X=1) = 0.5738 Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 31 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Using the Bayes theorem, we obtain P ( Y = 0 | X = 1) = 1 - P ( Y = 1 | X = 1) = 1-0.5738=0.4262. Since P ( Y = l | X = 1) > P ( Y = 0 | X = 1), Team 1 has a better chance than Team 0 of winning the next match. To estimate the class-conditional probabilities P(X|Y): Two implementations of Bayesian classification methods are used: 1. Naive Bayes classifier 2. Bayesian belief network. 1. Naive Bayes Classifier A naive Bayes classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent Conditional independence: An example of conditional independence is the relationship between a person's arm length and his or her reading skills. One might observe that people with longer arms tend to have higher levels of reading skills. This relationship can be explained by the presence of a confounding factor, which is age. A young child tends to have short arms and lacks the reading skills of an adult. If the age of a person is fixed, then the observed relationship between arm length and reading skills disappears. Thus, we can conclude that arm length and reading skills are conditionally independent when the age variable is fixed. 5.7 ESTIMATING PREDICTIVE ACCURACY OF CLASSIFICATION METHODS The accuracy of a classification method is the ability of the method to correctly determine the class of a randomly selected data instance. It may be expressed as the probability of correctly classifying unseen data. Estimating the accuracy of a supervised classification method can be difficult if only the (raining data is available and all of that data has been used in building the model. In such situations, overoptimistic predictions are often made regarding the accuracy of the model. The accuracy estimation problem is much easier when much more data is available than is required for training the model. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 32 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology We would obviously like to obtain an accuracy estimate (e.g. a mean value and variance) that has low bias and low variance. Accuracy may be measured using a number of metrics. These include: Sensitivity Specificity Precision and Accuracy. The methods for estimating errors include: Holdout Random Sub-sampling, Cross-validation and leave-one-out Let us assume that the test data has a total of T objects. When testing a method we find that C of the T objects arc correctly classified. The error rate then may be defined as Error rate = (T – C)/T This error rate is expected to be a good estimate if the number of test data T is large. A method of estimating the error rate is considered biased if it either tends to underestimate the error or tends to overestimate it. A "confusion matrix" is sometimes used to represent the result of testing. Advantage of using Confusion matrix: It not only tells us how many got misclassified but also what misclassifications occurred. For example, in Table 3.11 we can see that 1 out of 10 objects that belonged to Class 3 got misclassified to Class 1. Table 3.11 A c o n f u s i o n ma t r i x f o r t h r e e c l a s s e s Predicted Class 1 2 3 1 8 2 0 True Class 2 1 9 0 3 1 2 7 Using Table 3.11 we can define the terms "false positive" (FP) and "false negative" (FN). False positive cases are those that did not belong to a class but were allocated to it. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 33 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology 5.7.1 Methods for estimating the accuracy of a method: Holdout Method: o The holdout method requires a training set and a test set. o The sets are mutually exclusive. o A larger training set would produce a better classifier, while a larger test set would produce a better estimate of the accuracy. A balance must be achieved. Random Sub-sampling Method: o Random sub-sampling is very much like the holdout method except that it does not rely on a single test set. o The holdout estimation is repeated several times and the accuracy estimate is obtained by computing the mean of the several trials. o Random sub-sampling is likely to produce better error estimates than those by the holdout method. K-fold Cross-validation Method: o In K-fold cross-validation, the available data is randomly divided into k disjoint subsets of approximately equal size. o One of the subsets is then used as the test set and the remaining k - 1 sets are used for building the classifier. o The test set is then used to estimate the accuracy. o This is done repeatedly k times so that each subset is used as a test subset once. o Cross-validation has been tested extensively and has been found to generally work well when sufficient data is available. Leave-one-out Method o Leave-one-out is a simpler version of K-fold cross-validation. o In this method. One of the training samples is taken out and the model is generated using the remaining training data. o Once the model is built, the one remaining sample is used for testing and the result is coded as 1 or 0 depending if it was classified correctly or not. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 34 Department of MCA o Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology The average of such results provides an estimate of the accuracy. o The leave-one-out method is useful when the dataset is small. o For large training datasets. Leave-one-out can become expensive. Bootstrap Method: o In this method, given a dataset of size n, a bootstrap sample is randomly selected uniformly with replacement by sampling n times and used to build a model. o It can be shown that only 63.2% of these samples are unique. o The error in building the model is estimated by using the remaining 36.8% of objects that are not in the bootstrap sample. o The final error is then computed as 0.632 times the training error plus 0.368 times the testing error. o The figures 0.632 and 0.368 are based on the assumption that if there were n samples available initially from which n samples with replacement were randomly selected for training data then the expected percentage of unique samples in the training data would be 0.632 or 63.2% and the number of remaining unique objects used in the test data would be 0.368 or 36.8% of the initial sample. o This is repeated and the average of error estimates is obtained. o The bootstrap method is unbiased and, in contrast to leave-one-out. o It has low variance but many iterations are needed for good error estimates if the sample is small. o A-fold cross-validation, leave-one-out and bootstrap techniques are used in many decision tree packages including CART and Clementine. 5.8 IMPROVING ACCURACY OF CLASSIFICATION METHODS Bootstrapping, bagging and boosting are techniques for improving the accuracy of classification results. They have been shown to be very successful for certain models, for example, decision trees. All three involve combining several classification results from the same training data. The aim of building several decision trees by using training data that has thrown confusion is to find out how these decision trees differ from those that have been obtained earlier. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 35 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology The bootstrapping shows that, all the samples selected are somewhat different from each other since, on the average, only 63.2% of the objects in them are unique. The bootstrap samples are then used for building decision trees which are then combined to form a single decision tree. Bagging (the name is derived from Bootstrap and aggregating) combines classification results from multiple models or results of using the same method on several different sets of training data. Bagging may also be used to improve the stability and accuracy of a complex classification model with limited training data by using sub-samples obtained by resampling, with replacement, for generating models. In contrast to simple voting used in bagging, weighted voting can also be used. A technique called boosting involves applying weights to different instances of training data depending on the quality of the previous result. Often the idea is to give more weight to those instances that are not being correctly classified and less weight to those that are frequently classified correctly. One approach is that different weights can be applied to data in different classes. For example, the weights could be inversely proportional to the accuracy of prediction in each class. Thus classes that are not being classified with a high level of accuracy will get a higher weight and the data from these classes will be selected more frequently in resampling. When the classifier is used again, the result can improve the performance of classes that were doing poorly in the last result. This is followed by another iteration of computing weights and so on until some stopping criterion is met Benefits of these methods are: o These techniques can provide a level of accuracy that usually cannot be obtained by a large single-tree model. o Creating a single decision tree from a collection of trees in bagging and boosting is not difficult o These methods can often help in avoiding the problem of over fitting since a number of trees based on random samples are used o Boosting appears to be on the average better than bagging although it is not always so. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 36 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology o On some problems bagging does better than boosting. o On the other hand, it should be noted that it is more difficult to visualize a large set of decision trees compared to a single tree. o Also, generally bagging and boosting produce a final classifier that is much more difficult to understand. Nevertheless, bagging and boosting generally do improve accuracy and are being used in several decision tree software packages. 5.9 OTHER EVALUATION CRITERIA FOR CLASSIFICATION METHODS Other criteria for evaluation of classification methods, like those for all data mining techniques, are: 1. Speed 2. Robustness 3. Scalability 4. Interpretability 5. Goodness of the model 6. Flexibility 7. Time complexity Speed o Speed involves not just the time or computation cost of constructing a model (e.g. a decision tree), it also includes the time required to learn to use the model. o Obviously, a user wishes to minimize both times although it has to be understood that any significant data mining project will take time to plan and prepare the data. Robustness o Data errors are common, in particular when data is being collected from a number of sources and errors may remain even after data cleaning. o It is therefore desirable that a method be able to produce good results in spite of some errors and missing values in datasets. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 37 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Scalability o Many data mining methods were originally designed for small datasets. o Many have been modified to deal with large problems. Given that large datasets are becoming common, it is desirable that a method continues to work efficiently for large disk-resident databases as well. Interpretability o An important task of a data mining professional is to ensure that the results of data mining are explained to the decision makers. o It is therefore desirable that the end-user be able to understand and gain insight from the results produced by the classification method. Goodness of the Model o For a model to be effective, it needs to fit the problem that is being solved for example, in a decision tree classification, it is desirable to find a decision tree of the right size and compactness with high accuracy. Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 38 Department of MCA Data Mining & Warehousing-CH-5 Notes KNS Institute of Technology Possible Questions from This chapter: 1. Define classification and explain the purpose of classification model? Ans : Page-1 2. Explain the general approach followed for building classification problems? Ans: Page-2 3. Explain how a decision tree works? 5 marks Ans: page-3, 4 & 5 4. Explain How to build a decision tree or Explain Hunt’s Algorithm? 5 marks Ans: page-5, 6 &7 5. Explain the different methods used to express the attribute test condition using figures ans also explain the measures taken for selecting the best split? 5+5 marks. Ans:Page-7, 8, 9 & 10 6. Explain the algorithm for decision tree induction? 5 marks Ans:Page-10 to 11 7. Write a note on web Robot detection? 5 marks Ans: page-11 & 12 8. Explain the different characteristics of decision tree induction? 5 marks Ans: Page-12 & 13 9. Explain Rule based classifier and how does rule based classifier works in detail? 10 marks Ans: Page-13, 14, 15 & 16 10. Explain the characteristics of rule based classifier? 5 marks, Ans: Page-16 11. Explain the effect of rule simplification? 5 marks, Ans: Page-17 & 18 12. Explain Rule ordering schemes in Rule based classifier? 5 marks Ans: Page-19 13. Explain in detail building of Rule based classifier using Direct method technique? 10 Marks, Ans: Page-20 to 24 14. Explain in detail building of Rule based classifier using Indirect method technique and also give the advantages of using Rule based classifiers? 10 marks, Ans: 24 to 26 15. Explain Nearest Neighbour classifier in Detail and also give the characteristics of nearest neighbour classifier? 10 marks, Ans: Page-27 to 30 16. Explain Bayesian classifiers in detail? 10 marks Ans: Page- 30 to 32 17. Explain the methods used for estimating the accuracy? 5 marks Ans: Page-34 & 35 18. Explain the evaluation criteria for classification methods? 5 marks Ans: Page-37 & 38 Lecturer: Syed Khutubuddin Ahmed Contact: [email protected] Page 39