Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
11/18/2015 Supervised vs. Unsupervised Learning • Supervised learning (classification) CENG 464 Introduction to Data Mining – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. Y=f(X) • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Prediction Problems: Classification vs. Numeric Prediction • Classification : – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Numeric Prediction – models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications – Credit/loan approval – Medical diagnosis: if a tumor is cancerous or benign – Fraud detection: if a transaction is fraudulent – Web page categorization: which category it is Process (1): Model Construction Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set (otherwise overfitting) – If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called validation (test) set Classification Algorithms Training Data NAME RANK M ike M ary B ill Jim D ave A nne A ssistant P rof A ssistant P rof P rofessor A ssociate P rof A ssistant P rof A ssociate P rof YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 1 11/18/2015 Illustrating Classification Task Process (2): Using the Model in Prediction Training and Test set are randomly sampled Classifier Testing Data Unseen Data Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Learning algorithm Class supervised Induction Learn Model Model 10 Training Set accuracy (Jeff, Professor, 4) NAME RANK T om M erlisa G eorge Joseph YEARS TENURED A ssistant P rof A ssociate P rof P rofessor A ssistant P rof 2 7 5 7 Tenured? no no yes yes Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Attrib3 Class Deduction 10 Test Set Find a mapping OR function that can predict class label of given tuple X Classification Techniques Example of a Decision Tree • Decision Tree based Methods • Bayes Classification Methods • • • • • Attrib2 Apply Model Rule-based Methods Nearest-Neighbor Classifier Artificial Neural Networks Support Vector Machines Memory based reasoning Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Root node: attributes Arrows: attribute test conditions Splitting Attributes Leaf nodes: class label Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Another Example of Decision Tree MarSt Taxable Income Cheat 1 125K Yes Single No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single Yes 60K 85K 9 No Married 75K No 10 No Single 90K Yes NO > 80K YES 10 Model: Decision Tree Training Data Tid Refund Marital Status Married Married NO Decision Tree Classification Task Single, Divorced Refund No Yes NO TaxInc < 80K > 80K Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Class Tree Induction algorithm Induction Learn Model Model 10 Training Set NO YES • There could be more than one tree that fits the same data! 10 • Performance/Speed of classification depends on tree structure-levels Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large Attrib2 110K Attrib3 ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set • Accuracy of the tree 2 11/18/2015 Apply Model to Test Data Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Test Data Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat No 80K No 80K Married ? Refund 10 No NO Yes MarSt Single, Divorced TaxInc < 80K MarSt Single, Divorced TaxInc NO < 80K YES NO Married NO > 80K YES NO Apply Model to Test Data Apply Model to Test Data Test Data Refund Yes Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 10 No NO Yes MarSt Single, Divorced TaxInc < 80K TaxInc < 80K Apply Model to Test Data NO NO YES Apply Model to Test Data TaxInc NO Taxable Income Cheat No 80K ? Refund Yes MarSt < 80K Test Data Refund Marital Status 10 Single, Divorced YES Taxable Income Cheat No 80K Married ? 10 MarSt Single, Divorced TaxInc NO > 80K Refund Marital Status No NO Married ? Married Test Data Married 80K Married > 80K NO No No MarSt NO Refund Taxable Income Cheat 10 Single, Divorced YES Yes Refund Marital Status No NO Married > 80K NO ? 10 No NO Married > 80K Married < 80K NO Married Assign Cheat to “No” NO > 80K YES 3 11/18/2015 Decision Tree Classification Task Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Class Induction Learn Model Model 10 Training Set Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Attrib2 DT algoritm Tree Induction algorithm Attrib3 Apply Model Class Decision Tree Deduction 10 Test Set Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm – ID3, C4.5 – CART – SLIQ,SPRINT Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)-attribute selection measure • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 22 How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous • Depends on number of ways to split – 2-way split – Multi-way split – Determine when to stop splitting 4 11/18/2015 Splitting Based on Nominal Attributes Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. • Multi-way split: Use as many partitions as distinct values. Size CarType Family Small Luxury Sports • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType OR {Family} Large Medium {Family, Luxury} • Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Splitting Based on Continuous Attributes • Different ways of handling OR Size {Small, Medium} {Sports} {Medium, Large} {Large} • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Taxable Income? < 10K Yes > 80K No [10K,25K) (i) Binary split – Binary Decision: (A < v) or (A v) {Small} Splitting Based on Continuous Attributes Taxable Income > 80K? – Discretization to form an ordinal categorical attribute Size [25K,50K) [50K,80K) (ii) Multi-way split • consider all possible splits and finds the best cut • can be more compute intensive How to determine the Best Split Which split? Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes Car Type? No Family Student ID? Luxury c1 Sports C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 Which test condition is the best? 5 11/18/2015 How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: C0: 9 C1: 1 C0: 5 C1: 5 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Why need attribute selection measure? Why need attribute selection measure? • Random attribute selection • Why not choose randomly? Attribute Selection-Splitting Rules Measures (Measures of Node Impurity) Provides a ranking for each attribute describing the given training tuples. The attribute having the best score for the measure is chosen as the splitting attribute for the given tuples. • Information Gain-Entropy • Gini Index • Misclassification error Which split? Information Gain • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. • Information gain tells us how important a given attribute of the feature vectors is. • We will use it to decide the ordering of attributes in the nodes of a decision tree. 6 11/18/2015 Brief Review of Entropy m=2 Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or impurity in these partitions Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: Attribute Selection Measure: Information Gain (ID3/C4.5) Ideally, we want the partitioning on attribute A to produce an exact classification of the tuples Information needed (after using A to split D into v partitions) to classify D: v | Dj | j 1 |D| InfoA ( D) Info( D j ) |Dj|/|D| acts as the weight of the jth partition InfoA(D) the smaller the expected information the greater the purity of the partitions m Info( D) pi log 2 ( pi ) Information gained by branching on attribute A Tells us how much would be gained by branching on A We want to partition on the attribute A that would do the best classification so that the amount of information still required to finish classifying the tuples is minimal Gain(A) Info(D) Info A(D) i 1 Info(D) is the average amount of info needed to indentify the class label of a tuple in D DT Algorithm-example Which attribute should we have as the top most node of our decision tree •Determine the information gain for each candidate attribute… –Gain(S, Outlook) –Gain(S, Humidity) –Gain(S, Wind) –Gain(S, Temperature) DT Algorithm-example Which attribute should we have as the top most node of our decision tree •Determine the information gain for each candidate attribute… –Gain(S, Outlook) = 0.246 –Gain(S, Humidity) = 0.151 –Gain(S, Wind) = 0.048 –Gain(S, Temperature) = 0.029 •So would have Outlook as top Node 7 11/18/2015 DT Algorithm-example Example: Which Attribute to Select? Information Gain age pi ni I(pi, ni) youth 2 3 0.971 middle-aged 4 0 0 senior 3 2 0.971 age income youth high youth high middle-aged high senior medium senior low senior low middle-aged low youth medium youth low senior medium youth medium middle-aged medium missle-aged high senior medium Attribute Selection: Information Gain Class P: buys_computer = “yes” Class N: buys_computer = “no” Info( D) I (9,5) age pi youth 2 middle-aged 4 senior 3 Infoage ( D) 5 4 I (2,3) I (4,0) 14 14 5 I (3,2) 0.694 14 9 9 5 5 log 2 ( ) log 2 ( ) 0.940 14 14 14 14 5 I (2,3) means “youth” has 5 out of 14 14 samples, with 2 yes’es and 3 no’s. ni I(pi, ni) 3 0.971 0 0 2 0.971 Hence age income student credit_rating youth high no fair youth high no excellent middle-aged high no fair senior medium no fair senior low yes fair senior low yes excellent middle-aged low yes excellent youth medium no fair youth low yes fair senior medium yes fair youth medium yes excellent middle-aged medium no excellent missle-aged high yes fair 45 senior medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Gain (age) Info( D) Infoage ( D) 0.246 Similarly, Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini(D) is defined as n gini( D) 1 p 2j j 1 where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as gini A ( D) • Reduction in Impurity: |D1| |D | gini( D1) 2 gini( D 2) |D| |D| gini( A) gini(D) giniA (D) • The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Gain Ratio for Attribute Selection (C4.5) • Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v | Dj | j 1 |D| SplitInfoA ( D) log 2 ( | Dj | |D| ) – GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. – gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2 9 5 gini( D) 1 0.459 14 14 • Suppose the attribute income partitions D into 10 in D1: {low, 10 4 medium} and 4 in D2 gini ( D) Gini( D ) Gini( D ) income{low, medium} 14 1 14 2 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index • All attributes are assumed continuous-valued • May need other tools, e.g., clustering, to get the possible split values • Can be modified for categorical attributes 48 8 11/18/2015 Comparing Attribute Selection Measures • The three measures, in general, return good results but – Information gain: • biased towards multivalued attributes – Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others – Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data –classification model built from a limited set of training examples is too representative of the training examples themselves rather than the general characteristics of the data – Too many branches, some may reflect anomalies due to noise or outliers – Poor accuracy for unseen samples – Remove least reliable branches from the tree • Two approaches to avoid overfitting – Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness measure falling below a threshold Other Attribute Selection Measures • CHAID: a popular decision tree algorithm, measure based on χ2 test for independence • C-SEP: performs better than info. gain and gini index in certain cases • G-statistic: has a close approximation to χ2 distribution • MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree • Multivariate splits (partition based on multiple variable combinations) – CART: finds multivariate splits based on a linear comb. of attrs. • Which attribute selection measure is the best? – Most give good results, none is significantly superior than others Performance of DT • Accuracyerror rate, the ratio of the number of misclassifications against total number of classifications • Natural performance measure for classification problems: error rate on a test set – Success: instance’s class is predicted correctly – Error: instance’s class is predicted incorrectly – Error rate: proportion of errors made over the whole set of instances – Accuracy: proportion of correctly classified instances over the whole set of instances accuracy = 1 – error rate • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” Confusion matrix Classification Cost • Associate a cost with misclassification • Class +: with brain tumor • Class -: without brain tumor – A FN error can be highly costly: a patient with brain tumor is missed and may die - A FP error is not so serious: a healthy patient is sent to extra screening unnecessarily – A classifier biased toward to the + class is preferred 9 11/18/2015 Accuracy in DT • Explore methods for the evaluation of performance of a classification model: – Bootstrap – Cross-validation Decision Tree Based Classification • Advantages: – – – – Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets • Disadvantages – Irrelevant attributes may affect badly the construction of a decision tree E.g. ID numbers – Small variations in the data can imply that very different looking trees are generated – A sub-tree can be replicated several times – Error-prone with too many classes – Not good for predicting the value of a continuous class attribute Chapter 8. Classification: Basic Concepts Bayesian Classification: Why? • Classification: Basic Concepts • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Decision Tree Induction • Foundation: Based on Bayes’ Theorem. • Bayes Classification Methods • Rule-Based Classification • Model Evaluation and Selection • Techniques to Improve Classification Accuracy: Ensemble Methods • Summary Bayes’ Theorem: Basics • Total probability Theorem: P(B) M i 1 P(B | A )P( A ) i i • Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H ) P( H ) / P(X) P(X) – Let X be a data sample (“evidence”): class label is unknown – Let H be a hypothesis that X belongs to class C – Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X – P(H) (prior probability): the initial probability • E.g., X will buy computer, regardless of age, income, … – P(X): probability that sample data is observed – P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E.g., Given that X will buy computer, the prob. that X is 31..40, medium income • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Prediction Based on Bayes’ Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem P( H | X) P(X | H ) P(H ) P(X | H ) P(H ) / P(X) P(X) • Informally, this can be viewed as posteriori = likelihood x prior/evidence • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 10 11/18/2015 Classification Is to Derive the Maximum Posteriori Naïve Bayes Classifier • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): P(C | X) i P(X | C )P(C ) i i P(X) • Since P(X) is constant for all classes, only P(C | X) P(X | C )P(C ) i i i needs to be maximized n P(X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k 1 • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age =youth, Income = medium, Student = yes Credit_rating = Fair) 1 e 2 ( x )2 2 2 P ( X | C i ) g ( xk , C i , Ci ) Naïve Bayes Classifier: An Example age income student credit_rating youth high no fair youth high no excellent middle-aged high no fair senior medium no fair senior low yes fair senior low yes excellent middle-aged low yes excellent youth medium no fair youth low yes fair senior medium yes fair youth medium yes excellent middle-aged medium no excellent missle-aged high yes fair senior medium no excellent Naïve Bayes Classifier: Training Dataset age income student credit_rating youth high no fair youth high no excellent middle-aged high no fair senior medium no fair senior low yes fair senior low yes excellent middle-aged low yes excellent youth medium no fair youth low yes fair senior medium yes fair youth medium yes excellent middle-aged medium no excellent missle-aged high yes fair senior medium no excellent g ( x, , ) buys_computer no no yes yes yes no yes no yes yes yes yes yes no buys_computer no no yes yes yes no yes no yes yes yes yes yes no • P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 • Compute P(X|Ci) for each class P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “youth ” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) 64 Avoiding the Zero-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero P( X | C i ) Naïve Bayes Classifier: Comments • Advantages – Easy to implement – Robust to noise – Can handle null values – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence, therefore loss of accuracy – Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier • How to deal with these dependencies? Bayesian Belief Networks n P( x k | C i ) k 1 • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) • Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 – The “corrected” prob. estimates are close to their “uncorrected” counterparts 65 11 11/18/2015 Chapter 8. Classification: Basic Concepts Rule-Based Classification • Classification: Basic Concepts • Construct sequence of rules from training set • Decision Tree Induction • Search rules to classify an unseen record • Bayes Classification Methods • Rule-Based Classification • Nearest Neighbour Approach • Techniques to Improve Classification Accuracy: Ensemble Methods • The rule that covers the record is fired to determine class of it • Rules can be discovered directly or indirectly • Summary Rule-Based Classification A rule-based classifier uses a set of IF-THEN rules for classification • An IF-THEN rule is an expression of the form: IF condition THEN conclusion where • Condition (or LHS ) is rule antecedent/precondition • Conclusion (or RHS ) is rule consequent • Ex: : IF age = youth AND student = yes THEN buys_computer = yes • The condition consists of one or more attribute tests that are logically ANDed • The rule’s consequent contains a class prediction: we are predicting whether a customer will buy a computer Using IF-THEN Rules for Classification • Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes – If rule is satisfied by X, it covers the tuple, the rule is said to be triggered/fired by returning the class predicting – Ideally, rules should be mutually exclusive and exhaustive Mutually exclusive: no two rules will be triggered for the same tuple Exhaustive: there is one rule for each possible attribute value combinationno need for a default rule Rule-Based Classification Assessment of a Rule: • Coverage of a rule: The percentage of instances that satisfy the antecedent of a rule (i.e., whose attribute values hold true for the rule’s antecedent). • Accuracy of a rule: The percentage of instances that satisfy both the antecedent and consequent of a rule Assessment of a rule: coverage and accuracy – ncovers = # of tuples covered by R – ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| where D: training data set accuracy(R) = ncorrect / ncovers Using IF-THEN Rules for Classification Two ways of executing a rule set: • Ordered set of rules (“decision list”):Order is important for interpretation of the rules • Unordered set of rules: Rules may overlap and lead to different conclusions for the same instance –all rules may be fired at the same time 12 11/18/2015 Using IF-THEN Rules for Classification If more than one rule is triggered, need conflict resolution strategy • Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., higher qualitywith the most attribute tests) • Rule ordering: prioritize rules beforehand • Class-based ordering: classes are sorted in order of decreasing importance like order of prevalence or misclassification cost per class. Within each class rules are not ordered • Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality like accuracy, coverage or size. The first rule satisfying X fires class prediction, any other rule satisfying X is ignored. Each rule in the list implies the negation of the rules that come before itdifficult to interpret • What if no rule is fired for X? default rule! Rule Extraction from a Decision Tree Rules are easier to understand than large trees One rule is created for each path from the root to a leaf and logically ANDed to form the rule antecedent Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are: • Mutually exclusive: no two rules will be satisfied for the same instance • Exhaustive: there is one rule for each possible attributevalue combination Using IF-THEN Rules for Classification If no rule is satisfied by X : – A default rule can be set up to specify a default class, based on a training set. – This may be the class in majority or the majority class of the instances that were not covered by any rule. – The default rule is evaluated at the end, if and only if no other rule covers X. – The condition in the default rule is empty. – In this way, the rule fires when no other rule is satisfied. Rule Extraction from a Decision Tree Rule Induction: Sequential Covering Method • Sequential covering algorithm: Extracts rules directly from training data • Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER • Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes • Steps: • Rules are learned one at a time • Each time a rule is learned, the tuples covered by the rules are removed • Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold age? young middle_aged student? no yes yes no senior credit rating? excellent fair yes yes • Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = middle_aged THEN buys_computer = yes IF age = senior AND credit_rating = excellent THEN buys_computer = no IF age = senior AND credit_rating = fair THEN buys_computer = yes Sequential Covering Algorithm When learning a rule for a class, C, we would like the rule to cover all or most of the training tuples of class C and none or few of the tuples from other classes while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples 13 11/18/2015 How to Learn-One-Rule? Initial prototype rule is generated and refined until a certain evaluation criteria is met. Two approaches: Specialization: • Start with the most general rule possible: empty ruleclass y • Best attribute-value pair is added from list A into the antecedent • Continue until rule performance measure cannot improve further – If income=high THEN loan_decision=accept – If income=high AND credit_rating=excellent THEN loan_decision=accept – Greedy algorithm: always add attribute –value pair which is best at the moment How to Learn-One-Rule? Rule-Quality measures: used to decide if appending a test to the rule’s condition will result in an improved rule: accuracy, coverage •Consider R1 correctly classifies 38 0f 40 tuples whereas R2 covers 2 tuples and correctly classifies all: which rule is better? Accuracy? •Different Measures: Foil-gain, likelihood ratio statistics, chisquare statistics How to Learn-One-Rule? Two approaches: generalization: • Start with the randomly selected positive tuple and convert to a rule that covers – Tuple: (overcast, high,false,P) can be converted to a rule as Outlook=overcast AND humidity=high AND windy=false class=P • Choose one attribute-value pair and remove it so that rule covers more positive examples • Repeat the process until the rule starts to cover negative examples How to Learn-One-Rule? Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in a better rule • considers both coverage and accuracy – Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition FOIL _ Gain pos'(log 2 pos' pos log 2 ) pos' neg' pos neg pos and neg are the # of positively and negatively covered tuples by R and Pos’ and neg’ are the # of positively and negatively covered tuples by R’ • favors rules that have high accuracy and cover many positive tuples • No test set for evaluating rules but Rule pruning is performed by removing a condition from the rule pos neg FOIL _ Prune( R) pos neg Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R Advantages of Rule-Based Classifiers • • • • • As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees Nearest Neighbour Approach • General Idea – The Model: a set of training examples stored in memory – Lazy Learning: delaying the decision to the time of classification. In other words, there is no training! – To classify an unseen record: compute its proximity to all training examples and locate 1 or k nearest neighbours examples. The nearest neighbours determine the class of the record (e.g. majority vote) – Rationale: “If it walks like a duck, quacks like a duck, and looks like a duck, it probably is a duck”. 14 11/18/2015 Nearest Neighbour Approach Nearest Neighbour Approach • kNN Classification Algorithm • PEBLS Algorithm- Parallel Examplar-Based Learning – – – – – – algorithm kNN (Tr: training set; k : integer; r : data record) : Class begin for each training example t in Tr do calculate proximity d(t, r) upon descriptive attributes end for; select the top k nearest neighbours into set D accordingly; Class := majority class in D return Class Class():■ end; Class based similarity measure is used A nearest neighbour algorithm (k = 1) Examples in memory have weights (exemplars) Simple training: assigning and refining weights A different proximity measure Algorithm outline: 1. 2. 3. Build value difference tables for descriptive attributes (in preparation of measuring distances between examples) For each training, refine the weight of its nearest neighbour Refine the weights of some training examples when classifying validation examples Nearest Neighbour Approach Nearest Neighbour Approach • PEBLS: Value Difference Table attribute A A1 A2 … Ai …. Am A1 A2 d(A1, A1) d(A1, A2) d(A2, A1) d(A2, A2) d(Ai, A1) … Aj d(A1, Aj) d(A2, Aj) … d(Ai, A2) d(Ai, Aj) d(Ai, Am) d(Am, A1) d(Am, A2) d(Am, Aj) d(Am, Am) k d (V1 ,V2 ) Civ1 C t 1 v1 Civ2 • PEBLS: Distance Function Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Am d(A1, Am) d(A2, Am) r C v2 Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Class N N P P P N P N P P P P P N m ( X , Y ) wX wY sunny 0 1.2 0.4 sunny overcast rain d (sunny, overcast) overcast 1.2 0 0.8 i 2 i i 1 where wX, wY: weights for X and Y, m: the number of attributes, xi, yi: values of the ith attribute for X and Y. wX Value Difference Table for Outlook r: is set to 1. Cv1: total number of examples with V1 Cv2: total number of examples with V2 Civ1: total number of examples with V1 and of class i Civ2: total number of examples with V2 and of class i d(x , y ) rain 0.4 0.8 0 2 4 3 0 3 3 1.2 5 4 5 4 5 5 T C where T: the total number of times that X is selected as the nearest neighbour, C: the total number of times that X correctly classifies examples. Class P Class N Nearest Neighbour Approach Nearest Neighbour Approach • PEBLS: Distance Function (Example) ROW# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Class N N P P P N P N P P P P P N • PEBLS-: Example After Training: Value Difference Tables Outlook sunny overcast rain sunny 0 1.2 0.4 Temperature hot mild cool hot 0 0.33 0.5 overcast 1.2 0 0.8 mild 0.33 0 0.33 rain 0.4 0.8 0 cool 0.5 0.33 0 Windy TRUE FALSE TRUE 0 0.418 FALSE 0.418 0 Humidity high normal high 0 0.857 normal 0.857 0 Assuming row1.weight = row2.weight = 1, (row1, row2) = d(row1outlook, row2outlook)2 + d(row1temperature, row2temperature)2 + d(row1humidity, row2humidity)2 + d(row1windy, row2windy)2 = d(sunny, sunny)2 + d(hot, hot)2 + d(high, high)2 + d(false, true)2 = 0 + 0 + 0 + (1/2)2 =1/4 1 sunny hot high false N 2 2 sunny hot high true N 1 3 overcast hot high false P 1 4 rain mild false P 1.5 5 rain cool normal false P 1.5 cool normal true high N 2 7 overcast cool normal true P 1 8 sunny mild high false N 2 9 sunny cool normal false P 1 10 rain 1 6 rain mild normal false P 11 sunny mild normal true P 1 12 overcast mild high true P 2 13 overcast hot normal false P 1 14 rain mild high 1 true N • each row is added to the memory space with weight 1 • every time a new example is added, the distance of it with each existing examples is calculated • The nearest neighbour is located and the class value is compared • İf the sameweight is tuned down by increasing the number of total use and total number of correct use by 1 • if differenttotal number of use incresed by 1 only.Then validation examples are used to modify weights 15 11/18/2015 Artificial Neural Network Approach Nearest Neighbour Approach • PEBLS: Example 8 sunny mild high false 9 sunny cool normal false 10 rain mild normal false 11 sunny mild normal true 12 overcast mild high true 13 overcast hot normal false 14 rain mild high true N P P P P P N 2 1 1 1 2 1 1 1 sunny hot high false N 2 sunny hot high true N 3 overcast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N Unsee n record 2 1 1 1.5 1.5 2 1 2 1 1 1 2 1 1 overcast hot high false 2 1 1 1.5 1.5 2 1 ? N N P P P N P overcast hot high false 1 sunny hot high false 2 sunny hot high true 3 overcast hot high false 4 rain mild high false 5 rain cool normal false 6 rain cool normal true 7 overcast cool normal true P Classifying an unseen record After Training: -the distance between the unseen record and every examplar in the set is calculated and nearest neighbour is located -class of the nearest neighbour is considered to be the class of the record – Our brains are made up of about 100 billion tiny units called neurons. – Each neuron is connected to thousands of other neurons and communicates with them via electrochemical signals. – Signals coming into the neuron are received via junctions called synapses, these in turn are located at the end of branches of the neuron cell called dendrites. – The neuron continuously receives signals from these inputs – What the neuron does is sum up the inputs to itself in some way and then, if the end result is greater than some threshold value, the neuron fires. – It generates a voltage and outputs a signal along something called an axon. Artificial Neural Network Approach Artificial Neural Network Approach • General Idea – The Model: A network of connected artificial neurons – Training: select a specific network topology and use the training example to tune the weights attached on the links connecting the neurons – To classify an unseen record X, feed the descriptive attribute values of the record into the network as inputs. The network computes an output value that can be converted to a class label Artificial Neural Network Approach • Artificial Neuron (Perceptron) i1 i2 w1 w2 w3 A neural network can have many hidden layers, but one layer is normally considered sufficient The more units a hidden layer has, the more capacity of pattern recognition The output layer can have more than one unit, predicting the likelihood on a number of classes The constant inputs can be fed into the units in the hidden and output layers as inputs. Network with links from lower layers to upper layersfeed-forward nw Network with links between nodes of the same layerrecurrent nw Artificial Neural Network Approach • General Principle for Training an ANN S y i3 Sum function: Transformation function: x = w1*i1 + w2*i2 + w3*i3 Sigmoid(x) = 1/(1+e-x) Y Sigm oid Function X algorithm trainNetwork (Tr: training set) : Network Begin R = initial network with a particular topology; initialise the weight vector with random values w(0); repeat for each training example t=<xi, yi> in Tr do compute the predicted class output ŷ(k) for each weight wj in the weight vector do update the weight wj: wj(k+1) := wj(k) + (yi - ŷ(k))xij end for; end for; until stopping criterion is met return R end; : the learning factor. The more the value is, the bigger amount weight changes. 16 11/18/2015 Artificial Neural Network Approach Artificial Neural Network Approach • Network Topology • Using ANN for Classification – Multiple hidden layers: – # of nodes in input layer: determined by # and data types of attributes: • Do not know the actual class value and hence difficult to adjust the weight • Solution: Back-propagation (layer by layer from the output layer) • Continuous and binary attributes: 1 node for each attribute • categorical attribute: convert to numeric or binary – Model Overfitting: use validation examples to further tune the weights in the network – Convert input values into numbers before classification. Convert numeric output into class label after classification – Descriptive attributes should be normalized or converted to binary – Training examples are used repeatedly. The training cost is therefore very high. – Difficulty in explaining classification decisions – Attribute with k labels needs at least log k nodes – # of nodes in output layer: determined by # of classess • For 2 class solution 1 node • K class solution at least log k nodes – # of hidden layers and nodes in the hidden layers: difficult to decide – in NWs with hidden layers: updating weights using backpropagation Classifier Evaluation Metrics: Confusion Matrix Model Evaluation and Selection • Evaluation metrics: How can we measure accuracy? Other metrics to consider? • Use validation test set of class-labeled tuples instead of training set when assessing accuracy • Methods for estimating a classifier’s accuracy: – Holdout method, random subsampling – Cross-validation – Bootstrap • Comparing classifiers: – Confidence intervals – Cost-benefit analysis and ROC Curves Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class Confusion Matrix: how well the classifier can recognize tuples of different classes • Actual class\Predicted class yes no yes True Positives (TP) False Negatives (FN) no False Positives (FP) True Negatives (TN) Example of Confusion Matrix: • • • • TP: positive tuples that were correctly labeled by the classifier TN: negative tuples that were correctly labeled by the classifier FP: negative tuples that were incorrectly labeled as positive FN: positive tuples that were mislabeled as negative Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P yes no yes True Positives (TP) False Negatives (FN) no False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted buy_computer buy_computer class = yes = no Total buy_computer = yes 6954 46 buy_computer = no 412 2588 3000 Total 7366 2634 10000 • TP and TN are the correctly predicted tuples • May have extra rows/columns to provide totals 7000 Y N Class Imbalance Problem: One class may be rare, e.g. N FP TN N fraud, or HIV-positive P’ N’ All Significant majority of the negative class and minority of • Classifier Accuracy, or the positive class recognition rate: percentage of test set tuples that are correctly Sensitivity: True Positive classified recognition rate Accuracy = (TP + TN)/All Sensitivity = TP/P • Error rate:misclassification rate Specificity: True Negative recognition rate 1 – accuracy, or Specificity = TN/N Error rate = (FP + FN)/All Y TP FN P 17 11/18/2015 Classifier Evaluation Metrics: Precision and Recall, and F-measures Classifier Evaluation Metrics: Example • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples are the classifier labeled as positive? • Perfect score is 1.0 • Inverse relationship between precision & recall • F measure (F1 or F-score): harmonic mean of precision and recall, – Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) Precision = ?? Recall = ?? • Fß: weighted measure of precision and recall – assigns ß times as much weight to recall as to precision Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods • Holdout method – Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation – Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) – Randomly partition the data into k mutually exclusive subsets, each approximately equal size – At i-th iteration, use Di as test set and others as training set – Leave-one-out: k folds where k = # of tuples, 1 sample is left out for testing HW • • • • • Research and learn about: Ensemble methods principles Bagging Boosting ROC curves Evaluating Classifier Accuracy: Bootstrap • Bootstrap – Samples the given training tuples uniformly with replacement • i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set – Examples used for training set can be used for test set too Ensemble Methods: Increasing the Accuracy • Ensemble methods – Use a combination of models to increase accuracy – Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods – Bagging, boosting, Ensemble 18 11/18/2015 Classification of Class-Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data • Typical methods for imbalance data in 2-class classification: – Oversampling: re-sampling of data from positive class – Under-sampling: randomly eliminate tuples from negative class Model Selection: ROC Curves • • • • • • ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Diagonal line: for every TP, equally likely to encounter FP The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model Issues Affecting Model Selection Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 Comparison of Techniques • Comparison of Approaches • Accuracy Classification Approaches Decision Tree Nearest Neighbours Rul-base – classifier accuracy: predicting class label • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) Model Interpretability Model maintenability Training Cost Classifcation Cost Artificial Neural Network Bayesian Classifier • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases Model Interpretability: ease of understanding classification decisions • Interpretability Model maintenability: ease of modifying the model in the presence of new training examples Training cost: computational cost for building a model Classification cost: computational cost for classifying an unseen record – understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules Comparison of Techniques • Comparison of Approaches Classification Model Approaches Interpretability Decision Tree High Nearest Medium Neighbours Rul-base High Artificial Neural Low Network Bayesian Medium Classifier Model maintenability Low Decision Tree Induction in Weka • Overview Training Cost High Classifcation Cost Low High Low High Low High Medium Low High Low Medium Medium Low Model Interpretability: ease of understanding classification decisions Model maintenability: ease of modifying the model in the presence of new training examples Training cost: computational cost for building a model – – – – – – – – ID3 (only work for categorical attributes) J48 (Java implementation of C4.5) RandomTree (with K attributes) RandomForest (a forest of random trees) REPTree (regression tree with reduced error pruning) BFTree (best-first tree, using Gain or Gini) FT (functional tree, logistic regression as split nodes) SimpleCart (CART with cost-complexity pruning) Classification cost: computational cost for classifying an unseen record 19 11/18/2015 Decision Tree Induction in Weka • Preparation Decision Tree Induction in Weka • Constructing Classification Models (ID3) Pre-processing attributes if necessary 1. Choosing a method and setting parameters Specifying the class attribute 2. Setting a test option 4. View the model and evaluation results 3. Starting the process Selecting attributes 5. Selecting the option to view the tree Decision Tree Induction in Weka • J48 (unpruned tree) Decision Tree Induction in Weka • RandomTree Decision Tree Induction in Weka • Classifying Unseen Records Decision Tree Induction in Weka • Classifying Unseen Records 1. Preparing unseen records in an ARFF file 2. Classifying unseen records in the file 1.Selecting this option and click Set… button 2.Press the button and load the file Class values are left as “unknown” (“?”) 3.Press to start the classification 20 11/18/2015 Decision Tree Induction in Weka Decision Tree Induction in Weka • Classifying Unseen Records • Classifying Unseen Records 3. Saving Classification Results into a file 4. Classification Results in an ARFF file 2.Setting both X and Y to instance_number 3.Saving the results into a file Class labels assinged 1.Selecting the option to pop up visualisation Comparison of Techniques Comparison of Techniques • Comparison of Performance in Weka • Setting up Experiment in Weka – A system module known as Experimenter – Designated for comparing performances on techniques for classification over a single or a collection of data sets – Data miners setting up an experiment with: • • • • Choosing a Test options Naming the file to store experiment results Adding data sets Selected data set(s) Selected algorithms(s) and times of repeated operations Selected test option (e.g. cross validation) Selected p value (indicating confidence) The list of data sets selected Comparison of Techniques The list of selected algorithms Classification in Practice • Experiment Results in Weka • Process of a Classification Project Analysis method Performing the Analysis No. of times each algorithm repeated Add an algorithm – Output accuracy rates of the algorithms – Pairwise comparison of algorithms with significant better and worse accuracy marked out. Value of significance New or existing experiment Loading Experiment Data Results of Pairwise Comparisons 1. 2. 3. 4. 5. Locate data Prepare data Choose a classification method Construct the model and tune the model Measure its accuracy and go back to step 3 or 4 until the accuracy is satisfactory 6. Further evaluate the model from other aspects such as complexity, comprehensibility, etc. 7. Deliver the model and test it in real environment. Further modify the model if necessary 21 11/18/2015 References (1) Classification in Practice • Data Preparation – – – – Identify descriptive features (input attributes) Identify or define the class Determine the sizes of the training, validation and test sets Select examples • • • • Spread and coverage of classes Spread and coverage of attribute values Null values Noisy data – Prepare the input values (categorical to continuous, continuous to categorical) • • • • • • • • • References (3) • T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000. • J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994. • M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96. • T. M. Mitchell. Machine Learning. McGraw Hill, 1997. • S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 • J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. • J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93. • J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. • J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96. C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07 H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08 W. Cohen. Fast effective rule induction. ICML'95 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. SIGMOD'05 References (4) • • • • • • • • • R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’98. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’96. J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005. X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03. 22