Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture outline SATOMGI Data Mining and Matching Lecture 5: Introduction to classification and prediction Classification and prediction • Classification • Predicts categorical class labels (discrete or nominal) • Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction • Models continuous-valued functions (predicts unknown or missing values) • Typical applications Credit approval Target marketing Medical diagnosis Fraud detection ... • Classification and prediction • Classification – A two step process • Supervised versus un-supervised learning • Evaluating classification methods • Decision tree induction • Overfitting • Enhancements to basic decision tree induction • Classification in large databases • Other classification techniques • Measuring classifier accuracy • Evaluating the accuracy of a classifier • What is prediction Classification – A two step process • Model construction: Describing a set of predetermined classes • Each tuple/sample/record is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is the training set • The model is represented as classification rules, decision trees, or mathematical formulas • Model usage: For classifying future or unknown objects • Need to estimate the accuracy of the model • The known label of a test sample is compared with the classified results from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • The test set is independent of the training set, otherwise overfitting will occur • If the accuracy is acceptable, use the model to classify data objects whose class labels are not known Example: Using the model in classification Example: Model construction Classification Algorithm Accuracy: 75% Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Testing Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Supervised versus un-supervised learning Classifier NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured? Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Evaluating classification methods • Accuracy (or other quality measures) • Supervised learning (classification and prediction) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Un-supervised learning (clustering and association rules mining) • The class label of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of associations or clusters in the data • Classifier accuracy: Predicting class label • Predictor accuracy: Guessing value of predicted attributes • Speed and complexity • Time to construct the model (training time) • Time to use the model (classification/prediction time) • Scalability: Efficiency in handling disk-based databases • Robustness: Handling noise and outliers • Interpretability: Understanding and insight provided by the model • Other measures (for example goodness of rules) • Such as decision tree size or compactness of classification rules Decision tree induction: Example training data set What rule would you learn to identify who buys a computer? Age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 Income Student Credit rating Buys computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Algorithm for decision tree induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all training examples are at the root • Take the best immediate (local) decision while building the tree • Data is partitioned recursively based on selected attributes • Attributes are categorical (if continuous-valued, they need to be discretised in advance) • Attributes are selected on the basis of a heuristic or statistical measure • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning (majority voting is employed for classifying the leaf) • There are no samples left Example output: A decision tree for “Buys computer” age? <=30 31..40 overcast student? no no >40 credit rating? yes yes yes excellent no fair yes Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Overfitting and tree pruning • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Classify training records very well, but poor accuracy for unseen records • Two approaches to avoid overfitting • Prepruning: Halt tree construction early: do not split a node if this would result in a goodness measure falling below a threshold (difficult to choose an appropriate threshold) • Postpruning: Remove branches from a “fully grown” tree: get a sequence of progressively pruned trees (use a set of data different from the training data to decide which is the “best pruned tree”) Enhancements to basic decision tree induction • Allow for continuous-valued attributes • Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals • Handle missing attribute values • Assign the most common value of the attribute • Assign probability to each of the possible values • Attribute construction • Create new attributes based on existing ones that are sparsely represented • This reduces fragmentation, repetition, and replication Classification in large databases • Classification: a classical problem extensively studied by statisticians and machine learning researchers • Scalability: classifying data sets with millions of examples and hundreds of attributes with reasonable speed • Why decision tree induction in data mining? • Relatively fast learning speed (compared to other classification methods) • Convertible to simple and easy to understand classification rules • Can use SQL queries for accessing databases • Comparable classification accuracy with other methods Other classification methods Measuring classifier accuracy • Bayesian classification (a statistical classifier based on probabilities, assign new cases to most likely class) • Rule based classification (rules can be generated from a decision tree, or directly from data) • Classification of new tuples/records results in a confusion matrix with positive and negative examples • For example: if age = “31..40” then “buys computer” = “yes” • Support vector machines (transform data into a highdimensional space to make classes linearly separable) • Artificial neural networks (uses simple neurons and learn weights of connections between neurons) • k-nearest-neighbour classifier (classify a new record according to the class of it's k nearest neighbours) Predicted class True class C1 C2 C1 True positive False negative C2 False positive True negative • Accuracy of a classifier M, acc(M) is the percentage of test set tuples that are correctly classified by the model M • accuracy = (true_pos + true_neg) / (all_pos + all_neg) • Error rate (misclassification rate) of M = 1 – acc(M) Classifier accuracy measures Evaluating the accuracy of a classifier classes buy_computer = yes buy_computer = no total recognition(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.42 • Accuracy in example here is: (6954+2588) / 10000 = 95.4% • Ideally, non-diagonal entires should be zero or close to it • Alternative accuracy measures: • sensitivity = true_pos / all_pos (true positive recognition rate) (also called recall) • specificity = true_neg / all_neg (true negative recognition rate) • precision = true_pos / (true_pos + false_pos) (also called positive predictor value) • This model can also be used for cost-benefit analysis What is prediction? • (Numerical) prediction is similar to classification • Construct a model • Use model to predict continuous or ordered value for a given input • Prediction is different from classification • Classification refers to predict categorical class label • Prediction models continuous-valued functions • Major method for prediction: regression • Model the relationship between one or more independent or predictor variables and a dependent or response variable • Regression analysis • Linear and multiple regression • Non-linear regression • Other regression methods: generalised linear model, Poisson regression, log-linear models, regression trees • Holdout method • Given data is randomly partitioned into two independent sets Training set (for example, 2/3) for model construction Test set (for example, 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = average of the accuracies obtained • Cross-validation (k-fold, where k=10 is most popular) • Randomly partition the data into k mutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out: k folds where k = number of tuples, for small sized data • Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the same as that in the initial data Lecture summary • Classification and prediction are supervised learning techniques (training data with know class labels needed) • Many different techniques available • Various measures to get accuracy of a classifier • For data mining, classification techniques have to be scalable with size of the data as well as its dimensionality (number of attributes) • Prediction can be used for continuous values, classification for categorical values • Predict salary values based on past salary and other attributes • Classify if a patient should get cancer test or not