Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 1 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 2 Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 3 Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds 4 Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem P(H | X ) P( X | H )P(H ) P( X ) Informally, this can be written as posterior =likelihood x prior / evidence MAP (maximum posteriori) hypothesis h arg max P(h | D) arg max P(D | h)P(h). MAP hH hH Practical difficulty: require initial knowledge of many probabilities, significant computational cost 5 Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: n P( X | C i) P( x k | C i ) k 1 The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) 6 Example H X is an apple P(H) priori probability that X is an apple X observed data round and red P(H/X) probability that X is an apple given that we observe that it is red and round P(X/H) posteriori probability that a data is red and round given that it is an apple P(X) priori probabilility that it is red and round 7 Applying Bayesian Theorem P(H/X) = P(H,X)/P(X) from Bayesian theorem Similarly: P(X/H) = P(H,X)/P(H) P(H,X) = P(X/H)P(H) hence P(H/X) = P(X/H)P(H)/P(X) calculate P(H/X) from P(X/H),P(H),P(X) 8 Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(Ci|X) = prob. that the sample tuple X=<x1,…,xk> is of class Ci. There are m classes Ci i =1 to m E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label Ci such that P(Ci|X) is maximal P(Ci|X)> P(Cj|X) 1<=j<=m ji 11 Estimating a-posteriori probabilities Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) P(X) is constant for all classes P(Ci) = relative freq of class Ci samples Ci such that P(Ci|X) is maximum = Ci such that P(X|Ci)·P(Ci) is maximum Problem: computing P(X|Ci) is unfeasible! 12 Naïve Bayesian Classification Naïve assumption: attribute independence P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci) If i-th attribute is categorical: P(xi|Ci) is estimated as the relative freq of samples having value xi as i-th attribute in class Ci =sik/si . If i-th attribute is continuous: P(xi|Ci) is estimated thru a Gaussian density function Computationally easy in both cases 13 Training dataset age Class: <=30 C1:buys_computer= <=30 ‘yes’ 30…40 C2:buys_computer= >40 >40 ‘no’ >40 31…40 Data sample <=30 X =(age<=30, <=30 Income=medium, >40 Student=yes <=30 Credit_rating= 31…40 Fair) 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 14 Solution Given the new customer What is the probability of buying computer X=(age<=30 ,income =medium, student=yes,credit_rating=fair) Compute P(buy computer = yes/X) and P(buy computer = no/X) Decision: list as probabilities or chose the maximum conditional probability 15 Compute P(buy computer=yes/X) = P(X/yes)*P(yes)/P(X) and P(buy computer=no/X) P(X/no)*P(no)/P(X) Drop P(X) as it is the common denominator Decision: maximum of P(X/yes)*P(yes) P(X/no)*P(no) 16 Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(X/C = yes)*P(yes) P(age=“<30” | buys_computer=“yes”)* P(income=“medium” |buys_computer=“yes”)* P(credit_rating=“fair” | buys_computer=“yes”)* P(student=“yes” | buys_computer=“yes)* P(C =yes) 17 P(X/C = no)*P(no) P(age=“<30” | buys_computer=“no”)* P(income=“medium” | buys_computer=“no”)* P(student=“yes” | buys_computer=“no”)* P(credit_rating=“fair” | buys_computer=“no”)* P(C=no) 18 Naïve Bayesian Classifier: Example P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(buys_computer=“yes”)=9/14=0,643 P(buys_computer=“no”)=5/14=0,357 19 P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.0.667 = 0.044 P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.02 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|buys_computer=“no”) * P(buys_computer=“no”) =0.019*0.357=0.0007 X belongs to class “buys_computer=yes” 20 Class probabilities P(yes/X) = P(X/yes)*P(yes)/P(X) P(no/X) = P(X/no)*P(no)/P(X) What is P(X)? P(X)= P(X/yes)*P(yes)+P(X/no)*P(no) = 0.02 + 0.0007 = 0.0207 So P(yes/X) = 0.02/0.0207 P(no/X) = 0.0007/0.0207 Hence P(yes/X) + P(no/X) = 1 21 Summary of calculations Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 =0.019*0.357=0.0007 X belongs to class “buys_computer=yes” 22 Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks 23 Bayesian Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Y X Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles 24 Bayesian Belief Network: An Example Family History Smoker (FH, S) LungCancer PositiveXRay Emphysema Dyspnea Bayesian Belief Networks (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents n P( z1,..., zn ) P( z i | Parents( Z i )) i 1 25 Learning Bayesian Networks Several cases Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purpose D. Heckerman, Bayesian networks for data mining 26 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 27 Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches 28 Instance-Based Methods Instance-based learning: Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space. Locally weighted regression Constructs local approximation Case-based reasoning Uses symbolic representations and knowledgebased inference 29 The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples. . _ _ _ + _ _ . + + xq _ + . . . . 30 V: v1,...vn fp(xq) = argmaxvVki=1(v,f(xi)) (a,b) =1 if a=b otherwise 0 for real valued target functions fp(xq) = ki=1f(xi)/k 31 Discussion on the k-NN Algorithm The k-NN algorithm for continuous-valued target functions Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query point xq 1 giving greater weight to closer neighbors w d ( xq , xi )2 Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes. 32 Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes. stretch each variable with a different factor experiment on the best stretching factor by cross validation zi =kxi irrelevant variables has small k values k=0 completely eliminates the variable 33 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 34 Classification Accuracy: Estimating Error Rates Partition: Training-and-testing used for data set with large number of samples Cross-validation use two independent data sets, e.g., training set (2/3), test set(1/3) (holdout method) divide the data set into k subsamples use k-1 subsamples as training data and one subsample as test data—k-fold cross-validation for data set with moderate size Bootstrapping (leave-one-out) for small size data 35 cross-validation divide the data set into k subsamples S1,S2,..Sk use k-1 subsamples as training data and one subsample as test data --- k-fold cross-validation for data set with moderate size stratified cross validation folds are stratified so that the class distribution in each fold approximately the same as in the initial data 10-fold stratified cross validation is most common 36 leave-one-out for small size data k = S sample size in cross validation one sample is used for testing S-1 for training repeat for each sample 37 Bootstrapping sampling with replacement to form a training set a date set of N instances is samples N times with replacement some elements in the new data set are repeated test instances: not picked in the new data set but in the original data set 0.632 Bootstrap out of n cases with 1-1/n probability an object is not selected at each drawing after n drawings probability of not selecting a case is n -1 (1-1/n) =e =0.368 so %36.8 of the objects are not in the training sample or they constitute the test data error = 0.632*etest + 0.368*etrainingg The whole bootstrap is repeated several times and error rate is averaged 38 Bagging and Boosting General idea Training data Classification method (CM) Altered Training data Classifier C CM Classifier C1 Altered Training data …….. Aggregation …. CM Classifier C2 Classifier C* 39 Bagging Given a set S of s samples Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear more than once. Repeat this sampling procedure, getting a sequence of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm To classify an unknown sample X,let each classifier predict or vote The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes 40 Increasing classifier accuracy-Bagging WF pp 250-253 Bagging take t samples from data S sampling with replacement each case may occur more than ones a new classifier Ct is learned for each set St Form the bagged classifier C* by voting classifiers equally the majority rule For continuous valued prediction problems take the average of each predictor 41 Increasing classifier accuracyBoosting WF pp 254-258 Boosting increases classification accuracy Applicable to decision trees or Bayesian classifier Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor Boosting requires only linear time and constant space 42 Boosting Technique — Algorithm Assign every example an equal weight 1/N For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h(t) under w(t) Calculate the error of h(t) and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily (t+1) to sum to 1 (weights assigned to Normalize w different classifiers sum to 1) Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set 43 Boosting Technique (II) — Algorithm Assign every example an equal weight 1/N For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h(t) under w(t) Calculate the error of h(t) and re-weight the examples based on the error decrease the weights of correctly classified cases (t+1) = weightst*e/1-e weights e:current classifiers overall error for correctly classified cases weights for incorrectly classified cases are unchanged Normalize w(t+1) to sum to 1 when error > 0.5 delete current classifier and stop or e=0 stop Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set weight = -log(e/1-e) 44 Bagging and Boosting Experiments with a new boosting algorithm, freund et al (AdaBoost ) Bagging Predictors, Brieman Boosting Naïve Bayesian Learning on large subset of MEDLINE, W. Wilbur 45 Is accuracy enough to judge a classifier? Speed robustness (accuracy on noisy data) scalability number of I/O opp. For large datasets interpretability subjective objective: number of hidden units for ANN number of tree notes for decision trees 46 Alternatives to accuracy measure Ex: cancer or not_cancer cancer samples are positive not_cancer samples are negative sensitivity = t_pos/pos specificity = t_neg/neg precision = t_pos/(t_pos+f_pos) t_pos:#true positives cancer samples that were correctly classified as such 47 pos:# positive (cancer samples) t_neg:#true negatives not cancer samples that were correctly classified as such neg:# negative (not_cancer samples) f_pos:# false positives (not_cancer samples that were incorrectly labelled as cancer) accuracy = f(sensitivity,specificity) accuracy = sensitivity*pos/(pos+neg)+ specificity*neg/(pos+neg) 48 actual pos negative predicted pos t_pos f_pos neg f_neg t_neg 49 Evaluating numeric prediction Same strategies: independent test set, cross-validation, significance tests, etc. Difference: error measures Actual target values: a1 a2 …an Predicted target values: p1 p2 … pn Most popular measure: mean-squared error ( p1 a1 ) 2 ... ( pn an ) 2 n Easy to manipulate mathematically 50 Other measures The root mean-squared error : ( p1 a1 ) 2 ... ( pn an ) 2 n The mean absolute error is less sensitive to outliers than the mean-squared error: | p1 a1 | ... | pn an | n Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500) 51 Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout to 1,000,000 households • Mail to all; 0.1% respond (1000) • Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Identify subset of 400,000 most promising, 0.2% respond (800) A lift chart allows a visual comparison • 52 Generating a lift chart Sort instances according to predicted probability of being positive: Predicted probability Actual class 1 0.95 Yes 2 0.93 Yes 3 0.93 No 4 0.88 Yes x axis is sample size … … y axis is number of true positives … 53 A hypothetical lift chart 40% of responses for 10% of cost 80% of responses for 40% of cost 54 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 55 Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc.. 56 References (1) C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998. P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995. U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf., pages 601-606, AAAI Press, 1994. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998. J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999 57 References (2) M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997. B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998. W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001. J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. (EDBT'96), Avignon, France, March 1996. 58 References (3) T. M. Mitchell. Machine Learning. McGraw Hill, 1997. S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996. R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. 59 www.cs.uiuc.edu/~hanj Thank you !!! 60