Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Objectives Introduction What is Classification? Classification vs Prediction Supervised and Unsupervised Learning D t P Data Preparation ti Classification Accuracy ID3 Algorithm Information Gain Bayesian Classification Predictive Modelling ` ` ` ` ` ` Classification and Prediction ` Lecture 5/DMBI/IKI83403T/MTI/UI ` Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) Faculty of Computer Science, University of Indonesia ` ` 2 Introduction ` ` ` ` ` What is Classification? – A two-step two step process Databases are rich with hidden information that can be used for making intelligent business decisions. Classification and prediction can be used to extract models d describing ibi iimportant d data classes l or to predict di ffuture data d trends. Classification predicts categorical labels. labels Ex: categorize bank loan applications Æ safe or risky. Prediction models continuous-valued functions. Ex: predict the expenditures of potential customers on computer equipment given their income and occupation. Typical Applications: ` ` University of Indonesia ` Model construction: ` ` ` ` ` Credit approval, target marketing, Medical diagnosis, g , treatment effectiveness analysis y Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label. Data tuples are also referred to as samples, examples, or objects. All tuples used for construction is called training set. Since the class label of each training sample is provided Æ supervised learning. In clustering (unsupervised learning), th class the l llabels b l off each h training t i i sample l iis nott known, k and d th the number or set of classes to be learned may not be known in advance. The model is represented in the following forms: ` University of Indonesia 4 Classification rules,, ((IF-THEN statements), ), decision tree,, mathematical formulae University of Indonesia What is Classification? – A two-step two step process (2) Classification Process (1) The model is used for classifying future or unknown objects. ` ` Training Data First, the predictive accuracy of the model is estimated ` ` ` ` Classification Algorithms The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set otherwise over-fitting (it may have incorporated some particular anomalies of the training data that are not present in the overall sample population) will occur. If the accuracy of the model is considered acceptable acceptable, the model can be used to classify future objects for which the ) class label is not known ((unknown, ppreviouslyy unseen data). NAM E M ike M ary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ yes 6 5 University of Indonesia Classification Process (2) University of Indonesia What is Prediction? Prediction is similar to classification ` ` Classifier ` First, construct model. Second, use model to ppredict future or unknown objects j ` Testing Data 7 RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes es Unseen Data (Jeff, Professor, 4) NAM E Tom M erlisa George Joseph Major method for prediction is regression: Prediction is different from classification ` ` Tenured? University of Indonesia Linear and multiple regression Non-liner regression ` 8 Classification refers to predict categorical class label. Prediction refers to predict continuous value. University of Indonesia Classification vs Prediction Supervised vs Unsupervised Learning Sending out promotional literature to every new customer in the database can be quite costly. A more cosefficient method would be to target only those new customers who h are likely lik l to purchase h a new computer Æ classification. P d the Predict h number b off major purchases h that h a customer will make during a fiscal year Æ prediction. ` ` 9 ` Remove/reduce noise and the treatment of missing values Relevance Analysis ` ` ` 11 Supervised learning (classification) ` Supervision: The training data (observations, measurements, p byy labels indicatingg the class of the etc.)) are accompanied observations ` Based on the trainingg set to classifyy new data Unsupervised learning (clustering) ` We are given a set of measurements, measurements observations observations, etc with the aim of establishing the existence of classes or clusters in the data ` No training data, or the “training data” are not accompanied by class labels 10 University of Indonesia Issues – Data Preparation Data preprocessing can be used to help improve the accuracy, efficiency, and scalability of the classification or prediction process. Data Cleaning ` ` ` University of Indonesia Issues – Data Preparation ` ` ` Data Transformation ` ` ` ` ` Many of the attributes in the data may be irrelevant to the classification or prediction task. Ex: data recording the day of the week on which a bank loan application was filed is unlikely to be relevant to the success of the application. application Other attributes may be redundant. This step is known as feature selection. selection University of Indonesia ` 12 Data can be generalized to higher-level concepts. Useful fot continuous-valued attributes. Income can be generalized Æ low, medium, high. Street Æ city. Generalization compresses the original training data, fewer input/output operations may be involved during learning. Wh using When i neurall networks t k (or ( other th methods th d involving i l i distance measurements), data may also be normalized. University of Indonesia Comparing Classification Method Classification Accuracy: Estimating Error Rates Predictive accuracy Speed and scalability ` ` ` ` ` ` handling noise and missing values Cross-validation ` divide the data set into k subsamples ` use k-1 subsamples p as trainingg data and one sub-sample p as test data --- k-fold cross-validation ` for data set with moderate size ` Bootstrapping (leave-one-out) ` for small size data efficiency in large databases (not memory resident data) Interpretability: ` ` the level of understanding and insight provided by the model Goodness of rules ` ` ` decision tree size the compactness of classification rules 13 14 University of Indonesia What is a decision tree? ` ` Internal node denotes a test on an attribute Branch represents an outcome of the test ` All tuples in branch have the same value for the tested attribute. Leaf node represents class label or class label distribution. ` An Example from Quinlan’s ID3 To classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node that holds the class prediction f that for h sample. l Decision trees can easily be converted to classification rules. 15 University of Indonesia Training Dataset A decision tree is a flow-chart-like tree structure. ` ` ` time to construct the model time to use the model Scalability ` ` Partition: Training-and-testing ` use two independent data sets, e.g., training set (2/3), test set(1/3) ` used for data set with large number of samples Robustness ` ` ` University of Indonesia 16 Outlook sunny sunny overcast rain rain rain overcast sunny y sunny rain sunny overcast overcast rain Tempreature Humidity Windy Class hot high false N hot high true N hot high g false P mild high false P cool normal false P cool normal true tr e N cool normal true P mild high g false N cool normal false P mild normal false P mild ild normall true t P mild high true P hot normal false P mild high true N University of Indonesia A Sample Decision Tree Decision Tree Classification Methods Decision-Tree ` Outlook sunny overcast overcast humidity The basic top-down decision tree generation approach usually consists of two phases: ` rain windy P ` high normal true false N P N P 17 ` ` ` ` ` ` return N as a leaf node labeled with the most common class ` label N with the split-attribute f eachh value for l Ai off split-attribute, lit tt ib t grow a branch b h from f Node N d N let Si be the branch in which all tuples have the value Ai for split- attribute if Si is empty then 19 ` select split-attribute with highest information gain ` ` ` return N as a leaf node labeled with C if attribute-list is empty then ` ` Partition examples recursively based on selected attributes. Tree pruning Aiming at removing tree branches that may lead to errors when h classifying l if i ttestt d data t (t (training i i d data t may contain t i noise, i outliers, …) 18 ` Create a node N; if samples are all of the same class C, then ` At start, all the training examples are at the root. University of Indonesia Choosing Split Attribute – Information Gain (ID3/C4 (ID3/C4.5) 5) (1) All attributes are categorical ` ` ` University of Indonesia ID3 Algorithm ` Tree construction Assume all attributes to be categorical (discrete-values). Continuous-valued attributes must be discretized. Used to select the test attribute at each node in the tree. Also called measure of the goodness of split. The attribute with the highest g information ggain is chosen as the test attribute for the current node. attach a leaf labeled with the most common class Else recursively run the algorithm at Node Si until all branches reach leaf nodes University of Indonesia 20 University of Indonesia Information Gain (ID3/C4 (ID3/C4.5) 5) (2) ` Assume that there are two classes, P and N. ` Let L the h set off examples l S contain p elements l off class l P and dn elements of class N. ` The amount of information information, needed to decide if an arbitrary example in S belong to P or N is defined as I ( p, n ) = − ` Information Gain (ID3/C4 (ID3/C4.5) 5) (3) p p n n − lo g 2 lo g 2 p+n p+n p+n p+n ` is maximal,, that is,, E(A) ( ) is minimal since I(p, (p, n)) is the same to all attributes at a node. ` Assume that using attribute A as the root in the tree will partition S in sets {S1, S2 , …, Sv}. } ` If Si contains pi examples of P and ni examples of N, the information needed to classify objects in all subtrees Si : E( A) = v ∑ i =1 pi + ni p+n ` ` ` gain(temperature) = 0.029 gain(humidity) = 0.151 22 University of Indonesia ` See Table 7.1. Class label: buys_computer. Two values:YES, NO. m = 2. C1 correspond to yes, C2 correspond to no. 9 samples of class yes and 5 samples of class no. Compute the expected information needed to classify a given sample 9 9 5 5 I ( s1 , s 2 ) = I (9,5) = − log − log = 0 .940 2 14 2 14 14 14 University of Indonesia Next, compute the entropy of each attribute. Let’s start with the attribute ib age. ` ` ` ` For age = “<= 30”: s11 = 2 s21 = 3 I (s11, s21) = 0.971 For age = “31 31..40 40”:: s12 = 4 s22 = 0 I (s12, s22) = 0 For age = “>40”: s13 = 3 s23 = 2 I (s13, s23) = 0.971 Using equation (7.2), (7 2) the expected information needed to classify a given sample if the samples are partitioned by age is E ( age ) = ` ` 23 University of Indonesia Information Gain (ID3/C4 (ID3/C4.5) 5) (4) Examples: ` gain(outlook) i ( tl k) = 0.246 0 246 gain(windy) = 0.048 Information Gain (ID3/C4 (ID3/C4.5) 5) (3) ` In the given sample data, data attribute outlook is chosen to split at the root : I ( pi , ni) 21 ` The attribute A is selected such that the information gain gain(A) = I(p, n) - E(A) 5 4 5 I ( s 11 , s 21 ) + I ( s 12 , s 22 ) + I ( s 13 , s 23 ) = 0 . 694 14 14 14 Hence, the gain in information from such a partitioning: Gain(age) ( g ) = I ((s1, s2) – E ((age) g ) = 0.246 Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, Gain(Credit_rating) = 0.048. 24 University of Indonesia How to use a tree? ` Directly ` ` ` Tree Pruning ` test the attribute value of unknown sample against the tree. A path is traced from root to a leaf which holds the label Indirectly ` ` ` decision tree is converted to classification rules one rule is created for each path from the root to a leaf IF-THEN is easier for humans to understand ` ` Example: IF age = “<=30” AND student = “no” THEN buys_computer = “no” ` A decision tree constructed using the training data may have too many branches/leaf / f nodes. ` Caused by noise, overfitting ` May M result l poor accuracy for f unseen samples l Prune the tree: merge a subtree into a leaf node. ` Using U i a set off d data different diff from f the h training i i data. d ` At a tree node, if the accuracy without splitting is higher than the accuracy with splitting splitting, replace the subtree with a leaf node node, label it using the majority class. Pruning Criterion: ` ` ` 25 ` ` ` ` ` statisticians AI, especially machine learning researchers ` ` ` 27 ` most previous studies used small size data, and most algorithms are memory resident ` ` Recent data mining research contributes to ` Decision trees seem to be a good choice ` Database researchers re-examined the problem in the context of large databases ` University of Indonesia Classifying Large Dataset Classification is a classical problem extensively studied by ` 26 University of Indonesia Classification and Databases Pessimistic pruning: C4.5 MDL: SLIQ and SPRINT Cost complexity pruning: CART Scalability Generalization-based classification Parallel and distributed processing University of Indonesia relatively faster learning speed than other classification methods can be converted into simple and easy to understand classification rules can be used to generate SQL queries for accessing databases has comparable classification accuracy with other methods Classifying data-sets data sets with millions of examples and a few hundred even thousands attributes with reasonable speed. speed 28 University of Indonesia Scalable Decision Tree Methods Previous Efforts on Scalability ` Most algorithms assume data can fit in memory. ` Data mining research contributes to the scalability issue, p y for decision trees. especially ` Successful examples ` SLIQ (EDBT (EDBT’96 96 -- Mehta et al. al ’96) 96) ` SPRINT (VLDB96 -- J. Shafer et al.’96) ` PUBLIC (VLDB98 -- Rastogi & Shim Shim’98) 98) RainForest (VLDB98 -- Gehrke, et al.’98) ` ` Incremental tree construction (Quinlan’86) ( ) ` ` Data reduction (Cattlet’91) ` ` ` ` ` University of Indonesia Presentation of Classification Rules reducing data size by sampling and discretization. still a main memory algorithm. Data partition and merge (Chan and Stolfo’91) ` ` 29 using partial data to build a tree. testingg other examples p and those mis-classified ones are used to rebuild the tree interactively. partitioning data and building trees for each partition. merging multiple trees into a combined tree. experiment results indicated reduced classification accuracy. 30 University of Indonesia Other Classification Methods Bayesian Classification ` Neural Networks ` Genetic Algorithm ` Rough Set Approach ` k-Nearest Neighbor Classifier ` Case-Based Reasoning (CBR) ` Fuzzy Logic ` Support Vector Machine (SVM) ` 31 University of Indonesia 32 University of Indonesia Bayesian Classification ` ` ` ` ` Bayes Theorem (1) Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classification is based on Bayes theorem. Naive Bayesian Classifier is comparable in performance with decision tree and neural network classifiers. Bayesian classifiers also have high accuracy and speed when applied to large databases. 33 ` ` ` ` ` ` The probability that any given data sample is an apple, regardless of how the data sample looks. University of Indonesia See example 7.4 for example on Naive Bayesian Classification. University of Indonesia What if we would like to predict a continuous value, rather than a categorical label? ` The posterior probability is based on more information (such as background knowledge) than the prior probability b bili which hi h iis iindependent d d off X. X Bayes theorem is P(H | X ) = P( X | H )P(H ) 35 Support the world of data samples consists of fruits, described by their color and shape., Suppose that X is red and round, and that h H is i the h hhypothesis h i that h X is i an apple. l Th Then P(H|X) reflects our confidence that X is an apple given that we have seen that X is red and round. 34 ` Prediction of continuous values can be modeled by statistical techniques h i off regression. i Example: ` ` P(X (X ) ` Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data sample X belongs to a specified class C. We want to determine P(H|X), the probability the the hypothesis H holds given the observed data sample X. P(H|X) is the posterior probability or a posteriori probability, of H conditioned on X. Predictive Modeling in Databases P(H) is the prior probability or a priori probability, of H. ` ` University of Indonesia Bayes Theorem (2) ` ` ` ` Am modle dle tto predict redict the salary salar off ccollege lle e graduates rad ates with ith 10 years ears off work experience. Potential sales of a new product given its price. Many problems can be solved by linear regression. Software packages for solving regression problems: ` 36 SAS, SPSS, S-Plus University of Indonesia Linear Regression ` ` ` Data are modeled using a straight line. The simplest form of regression Bivariate liner regressions g models a random variable Y (called a response variable), as a linear function of another random variable, X (called a predictor variable) ` ` ` Prediction: Numerical Data Y=α+βX See Example p 7.6 for an example p of linear regression. g Other regression models ` ` Multiple regression Log-linear models 37 38 University of Indonesia Prediction: Categorical Data University of Indonesia Conclusion ` Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) ` Classification is probably one of the most widely used data mining techniques with a lot of applications. ` Scalability is still an important issue for database applications. ` Combiningg classification with database techniques q should be a promising research topic. ` Research Direction: Classification of non-relational non relational data, data e.g., eg text, spatial, multimedia, etc.. 39 University of Indonesia 40 University of Indonesia References ` References (2) C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation ` J. Magidson. The chaid approach to segmentation modeling: Chi-squared automatic g , editor,, Advanced Methods of Marketingg Research,, ppages g interaction detection. In R. P. Bagozzi, 118-159. Blackwell Business, Cambridge Massechusetts, 1994. ` M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March 1996. ` S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 ` J.J R. Quinlan. Bagging, gg g boosting, g and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence g (AAAI'96), 725-730, Portland, OR, Aug. 1996. ` R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998. ` J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996. ` S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. C Computer t SSystems, t 13, 13 1997. 1997 ` L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, p 1984. ` P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 39 44 Montreal, 39-44, M l Canada, C d August A 1995. 1995 ` U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf., ppages g 601-606,, AAAI Press,, 1994. ` J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998. ` M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997. 41 University of Indonesia 42 University of Indonesia