Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 6 Classification and Prediction (Part A) Zhou Shuigeng May 6, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 2 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 3 Predictive Modeling Goal: learn a mapping: y = f(x;θ) Need: A model structure A score function An optimization strategy Categorical y ∈ {c1,…,cm}: classification Real-valued y: regression Note: usually assume {c1,…,cm} are mutually exclusive and exhaustive 2007-6-23 Data Mining: Tech. & Appl. 4 Classification vs. Prediction Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis 2007-6-23 Data Mining: Tech. & Appl. 5 Two Classes of Classifiers 2007-6-23 Discriminative Classification Provide decision surface or boundary to separate out the different classes In real life, it is often impossible to separate out the classes perfectly Instead, seek function f(x;θ) that maximizes some measure of separation between the classes. We call f(x; θ) the discriminant function. Probabilistic Classification Focus on modeling the probability distribution of each class Data Mining: Tech. & Appl. 6 Discriminative vs. Probabilistic Classification D1 D3 D 2 2007-6-23 Data Mining: Tech. & Appl. 7 Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise overfitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known 2007-6-23 Data Mining: Tech. & Appl. 8 Classification Process (1): Model Construction Classification Algorithms Training Data NAM E M ike M ary Bill Jim Dave Anne 2007-6-23 RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Data Mining: Tech. & Appl. Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 9 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E Tom M erlisa George Joseph 2007-6-23 RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Data Mining: Tech. & Appl. Tenured? 10 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) 2007-6-23 The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Data Mining: Tech. & Appl. 11 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 12 Issues Regarding Classification and Prediction (1): Data Preparation Data cleaning Relevance analysis (feature selection) Preprocess data in order to reduce noise and handle missing values Remove the irrelevant or redundant attributes Data transformation 2007-6-23 Generalize and/or normalize data Data Mining: Tech. & Appl. 13 Issues regarding classification and prediction (2): Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules 2007-6-23 Data Mining: Tech. & Appl. 14 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation Summary 2007-6-23 Data Mining: Tech. & Appl. 15 Feature Selection - Definition Attempt to select the minimal size subset of features according to certain criteria: classification accuracy is at least not significantly dropped Resulting class distribution given only values for the selected features is as close as possible to the original class distribution given all features 2007-6-23 Data Mining: Tech. & Appl. 16 Feature selection: dimensions Evaluation Measure Accuracy Consistency Classic Complete Heuristic Nondeterministic Search strategy Forward Backward Random Generation Scheme 2007-6-23 Data Mining: Tech. & Appl. 17 Four Basic Steps Original Feature Set 1 Generation 2 Subset Evaluation Goodness of the subset No Stopping Criterion 3 2007-6-23 Data Mining: Tech. & Appl. Yes Learning/ testing 4 18 Search Procedure - Approaches Complete complete search for optimal subset Guaranteed optimality Able to backtrack Random Setting a maximum number of iterations possible Optimality depends on values of some parameters 2007-6-23 Heuristic “hill-climbing” iterate, then remaining features yet to be selected/rejected are considered for selection/rejection simple to implement fast in producing results produce sub-optimal results Data Mining: Tech. & Appl. 19 Evaluation Functions - Method Types Filter methods Distance Measures Information Measures Dependence Measures Consistency Measures Wrapper Methods 2007-6-23 Classifier Error Rate Measures Data Mining: Tech. & Appl. 20 Some Measures for Text Classification Information gain IG(t ) = ∑ c∈C ( P(c, t ) log( P(c, t ) P(c, t ) ) + P(c, t ) log( )). P(c) P(t ) P(c) P(t ) Mutual information MI(t, c) = log Pr (t | c) / Pr (t ). Chi-square ( P(c, f ) P(c , f ) − P(c, f ) P(c , f )) χ (c, f ) = . P(c) P( f ) P(c ) P( f ) 2 2007-6-23 Data Mining: Tech. & Appl. 21 Problem with the Filter Model 2007-6-23 Assess merits of features from only the data and ignores the classification algorithm generated feature subset may not be optimal for the target classification algorithm Data Mining: Tech. & Appl. 22 Wrapper Model Use actual target classification algorithm to evaluate accuracy of each candidate subset Evaluation Criteria Classifier error rate Generation method can be heuristic, complete or random 2007-6-23 Data Mining: Tech. & Appl. 23 Wrapper Compared to Filter & higher accuracy ' higher computation cost ' lack generality 2007-6-23 Data Mining: Tech. & Appl. 24 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 25 Training Dataset This follows an example from Quinlan’s ID3 2007-6-23 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Data Mining: Tech. & Appl. buys_computer no no yes yes yes no yes no yes yes yes yes yes no 26 Output: A Decision Tree for “buys_computer” age? <=30 student? 2007-6-23 overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Data Mining: Tech. & Appl. 27 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-andconquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 2007-6-23 Data Mining: Tech. & Appl. 28 Decision Tree Induction Algorithms Many Algorithms Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT 2007-6-23 Data Mining: Tech. & Appl. 29 Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any arbitrary tuple m I( s1,s 2,...,s m ) = − ∑ i =1 si si log 2 s s entropy of attribute A with values {a1,a2,…,av} s1 j + ...+ smj I ( s1 j ,...,smj ) s j =1 v E(A)= ∑ information gained by branching on attribute A Gain(A) = I(s 1, s 2 ,..., sm) − E(A) 2007-6-23 Data Mining: Tech. & Appl. 30 Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” g Class N: buys_computer = “no” g I(p, n) = I(9, 5) =0.940 g Compute the entropy for age: g age <=30 30…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 age income student credit_rating <=30 high no fair <=30 high no excellent 31…40 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >402007-6-23 medium no excellent 4 5 E ( age ) = I ( 2 ,3) + I ( 4,0 ) 14 14 5 + I (3, 2 ) = 0 .694 14 5 I ( 2,3) means “age <=30” has 5 14 out of 14 samples, with 2 yes’es and 3 no’s. Hence Gain ( age ) = I ( p , n ) − E ( age ) = 0.246 buys_computer no no yes yes yes no yes no yes yes yes yes yes Data Mining: Tech. & Appl. no Similarly, Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048 31 Other Attribute Selection Measures Gini index (CART, IBM IntelligentMiner) 2007-6-23 All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes Data Mining: Tech. & Appl. 32 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as gini (T ) = 1 − n 2 ∑ p j j =1 where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as gini split (T ) = N 1 gini (T 1) + N 2 gini (T 2 ) N N The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute) 2007-6-23 Data Mining: Tech. & Appl. 33 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” 2007-6-23 Data Mining: Tech. & Appl. 34 Avoid Overfitting in Classification Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” 2007-6-23 Data Mining: Tech. & Appl. 35 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle 2007-6-23 halting growth of the tree when the encoding is minimized Data Mining: Tech. & Appl. 36 Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction 2007-6-23 Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication Data Mining: Tech. & Appl. 37 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods 2007-6-23 Data Mining: Tech. & Appl. 38 Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) 2007-6-23 Data Mining: Tech. & Appl. 39 Data Cube-Based DecisionTree Induction Integration of generalization with decision-tree induction (Kamber et al’97). Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc. Low-level concepts, scattered classes, bushy classification-trees Semantic interpretation problems. Cube-based multi-level classification Relevance analysis at multi-levels. Information-gain analysis with dimension + level. 2007-6-23 Data Mining: Tech. & Appl. 40 Presentation of Classification Results 2007-6-23 Data Mining: Tech. & Appl. 41 Visualization of a Decision Tree in SGI/MineSet 3.0 2007-6-23 Data Mining: Tech. & Appl. 42 Interactive Visual Mining by PerceptionBased Classification (PBC) 2007-6-23 Data Mining: Tech. & Appl. 43 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 44 Bayesian Classification A probabilistic framework for solving classification problems Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 2007-6-23 Data Mining: Tech. & Appl. 45 Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds 2007-6-23 Data Mining: Tech. & Appl. 46 Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem P(H | X ) = P( X | H )P(H ) P( X ) Informally, this can be written as posterior =likelihood x prior / evidence MAP (maximum posteriori) hypothesis h ≡ arg max P(h | D) = arg max P(D | h)P(h). MAP h∈H h∈H Practical difficulty: require initial knowledge of many probabilities, significant computational cost 2007-6-23 Data Mining: Tech. & Appl. 47 Basic Formulas for Probabilities Product Rule: probability P(A ∧ B) of a conjunction of two events A and B: P(A ∧ B) = P(A | B) P(B) = P(B | A) P(A) Sum Rule: probability of a disjunction of two events A and B: P(A ∨ B) = P(A) + P(B) - P(A ∧ B) Theorem of total probability: if events A1,…, An are mutually exclusive with 2007-6-23 , then Data Mining: Tech. & Appl. 48 Naïve Bayes Classifier (I) 2007-6-23 Along with decision trees, neural networks, nearest neighbors, one of the most practical learning methods When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents Data Mining: Tech. & Appl. 49 Naïve Bayes Classifier (II) Assume target function f : X → V, where each instance x described by attributes <a1, a2 … an>. Most probable value of f(x) is: Na ï ve Bayes assumption: which gives Naive Bayes classifier: 2007-6-23 Data Mining: Tech. & Appl. 50 Naïve Bayes Algorithm Naïve Bayes Learn(examples) For each target value vj P(vj) ← estimate P^ (vj) For each attribute value ai of each attribute a P(ai |vj) ← estimate P^ (ai |vj) Classify New Instance(x) 2007-6-23 Data Mining: Tech. & Appl. 51 Naïve Bayes: Subtleties (I) Conditional independence assumption is often violated ...but it works surprisingly well anyway. Note don’t need estimated posteriors to be correct; need only that see [Domingos & Pazzani, 1996] for analysis Naive Bayes posteriors often unrealistically close to 1 or 0 Pedro Domingos, Michael Pazzan. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier (1996). 2007-6-23 Data Mining: Tech. & Appl. 52 Naïve Bayes: Subtleties (II) What if none of the training instances with target value vj have attribute value ai? Then Typical solution is Bayesian estimate for where n is number of training examples for which v = vi, nc number of examples for which v = vj and a = ai p is prior estimate for m is weight given to prior (i.e. number of “virtual” examples) 2007-6-23 Data Mining: Tech. & Appl. 53 Training dataset age <=30 Class: C1:buys_computer= <=30 30…40 ‘yes’ >40 C2:buys_computer= >40 ‘no’ >40 31…40 Data sample <=30 X =(age<=30, <=30 Income=medium, >40 Student=yes <=30 Credit_rating= 31…40 Fair) 31…40 >40 2007-6-23 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Data Mining: Tech. & Appl. buys_computer no no yes yes yes no yes no yes yes yes yes yes no 54 Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” 2007-6-23 Data Mining: Tech. & Appl. 55 NBC: Another Example 2007-6-23 Data Mining: Tech. & Appl. 56 Naïve Bayesian Classifier: Comments Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks 2007-6-23 Data Mining: Tech. & Appl. 57 Bayesian Belief Networks Interesting because: Naïve Bayes assumption of conditional independence too restrictive But it’s intractable without some such assumptions... Bayesian Belief networks describe conditional independence among subsets of variables → allows combining prior knowledge about (in)dependencies among variables with observed training data (also called Bayes Nets) 2007-6-23 Data Mining: Tech. & Appl. 58 Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if (∀xi, yj, zk) P (X= xi|Y= yj, Z= zk) = P (X= xi|Z= zk) more compactly, we write P (X|Y, Z) = P (X|Z) Example: Thunder is conditionally independent of Rain, given Lightning P(Thunder|Rain, Lightning) = P(Thunder|Lightning) 2007-6-23 Data Mining: Tech. & Appl. 59 Bayesian Belief Network (I) Provides graphical representation of probabilistic relationships among a set of random variables Consists of: A directed acyclic graph (dag) 2007-6-23 Node corresponds to a variable Arc corresponds to dependence relationship between a pair of variables A probability table associating each node to its immediate parent Data Mining: Tech. & Appl. 60 Bayesian Belief Network (II) Network represents a set of conditional independence assertions: 2007-6-23 A node in a Bayesian network is conditionally independent of all of its nondescendants, if its parents are known Data Mining: Tech. & Appl. 61 Probability Table If X does not have any parents, table contains prior probability P(X) If X has only one parent (Y), table contains conditional probability P(X|Y) If X has multiple parents (Y1, Y2,…, Yk), table contains conditional probability P(X|Y1, Y2,…, Yk) 2007-6-23 Data Mining: Tech. & Appl. 62 Example of Bayesian Belief Network 2007-6-23 Data Mining: Tech. & Appl. 63 Inference in Bayesian Networks How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard In practice, can succeed in many cases 2007-6-23 Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions Data Mining: Tech. & Appl. 64 Learning of Bayesian Networks Several variants of this mining task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it’s easy as training a Naive Bayes classifier 2007-6-23 Data Mining: Tech. & Appl. 65 Summary: Bayesian Belief Networks Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods … 2007-6-23 Data Mining: Tech. & Appl. 66 Outline What is Classification/Prediction? Issues Regarding Classification/Prediction Feature Selection Classification by Decision Tree Induction Bayesian Classification Classification Evaluation 2007-6-23 Data Mining: Tech. & Appl. 67 Classification Accuracy: Estimating Error Rates Partition: Training-and-testing used for data set with large number of samples Cross-validation use two independent data sets, e.g., training set (2/3), test set(1/3) divide the data set into k subsamples use k-1 subsamples as training data and one subsample as test data—k-fold cross-validation for data set with moderate size Bootstrapping (leave-one-out) 2007-6-23 for small size data Data Mining: Tech. & Appl. 68 Bagging and Boosting General idea Training data Classification method (CM) Classifier C Altered Training data Altered Training data …….. Aggregation …. 2007-6-23 CM Classifier C1 CM Data Mining: Tech. & Appl. Classifier C2 Classifier C* 69 Bagging Given a set S of s samples Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear more than once. Repeat this sampling procedure, getting a sequence of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm To classify an unknown sample X,let each classifier predict or vote The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes 2007-6-23 Data Mining: Tech. & Appl. 70 Boosting Technique — Algorithm Assign every example an equal weight 1/N For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h(t) under w(t) Calculate the error of h(t) and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily (t+1) to sum to 1 (weights assigned to Normalize w different classifiers sum to 1) Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set 2007-6-23 Data Mining: Tech. & Appl. 71 Bagging and Boosting Experiments with a new boosting algorithm Bagging Predictors Freund et al (AdaBoost ) Brieman Boosting Naïve Bayesian Learning on large subset of MEDLINE 2007-6-23 W. Wilbur Data Mining: Tech. & Appl. 72 Multiple Classifiers Combination Parallel combination Sequential combination 2007-6-23 Data Mining: Tech. & Appl. 73 References (1) C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998. P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995. U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf., pages 601-606, AAAI Press, 1994. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998. J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999 2007-6-23 Data Mining: Tech. & Appl. 74 References (2) M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997. B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998. W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001. J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. (EDBT'96), Avignon, France, March 1996. 2007-6-23 Data Mining: Tech. & Appl. 75 References (3) T. M. Mitchell. Machine Learning. McGraw Hill, 1997. S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDiciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996. R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. 2007-6-23 Data Mining: Tech. & Appl. 76