Download Classification

Classification Brendan Tierney Classification vs. Prediction §  Classification: §  predicts categorical class labels §  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data §  Prediction: §  models continuous-valued functions, i.e., predicts unknown or missing values §  Typical Applications §  §  §  §  credit approval target marketing medical diagnosis treatment effectiveness analysis 2 1 Classification: Definition §  Given a collection of records (training set ) §  Each record contains a set of attributes, one of the attributes is the class. §  Find a model for the class attribute as a function of the values of other attributes. §  Goal: previously unseen records should be assigned a class as accurately as possible. §  A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 3 Classification—A Two-Step Process §  Model construction: describing a set of predetermined classes §  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute §  The set of tuples used for model construction: training set §  The model is represented as classification rules, decision trees, or mathematical formulae §  Estimate accuracy of the model §  The known label of test sample is compared with the classified result from the model §  Accuracy rate is the percentage of test set samples that are correctly classified by the model §  Test set is independent of training set, otherwise over-fitting will occur Classification is about explaining the past and predicting the future. §  Model usage: for classifying future or unknown objects §  The model is used to classify data tuples whose class labels are not known 4 2 Important Columns Data Assembly Operations Target 3 50 – 50 Split §  Need to avoid a biased dataset §  50 – 50 split of target column - if possible §  Not always possible Target 2 Data Set 50% Target 1 Target 1 Target 2 50% 50% 50% Splitting the data set for Training and Testing 30% - 40% Testing Data Set Training Data Set Target 1 Target 2 ?% ?% Target 1 Target 2 50% 50% 60 %– 70% 4 Data Preparation §  Can involve lots and lots and lots of coding and data preparation §  1 records per case aka the analytical record §  Target value is determined based on defined rules (defined with help from the business) §  Sampling of data §  In some cases one Target value occurs in a very small % of cases §  Training set needs to be balanced §  Over sampling on small target cases Illustrating Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set 10 5 Classification Process (1): Model Construction Classification Algorithms Training Data NAM E M ike M ary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Example of a Decision Tree rule 11 Classification Process (2): Use the Model in Prediction Classifier (previously developed) Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom M erlisa G eorge Joseph RANK YEA R S TEN U R ED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Tenured? This is what we have But What should it be? 12 6 Example of a Decision Tree ca ic or eg t al ca ic or eg al t us uo n s i nt as cl co Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K 60K Splitting Attributes Refund Yes No NO MarSt TaxInc No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Married Single, Divorced NO < 80K > 80K YES NO 10 What attribute do we Decision start with ? Tree Model: This is called Information Gain Training Data 13 Another Example of Decision Tree ca te ric go al ca te ric go al us uo s it n n as cl co Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 14 7 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set 15 How usable or re-usable is the model §  Does it work for just this one time or can be be used over time with different data? 8 Apply Model to Test Data Test Data Start from the root of tree. Refund Refund Marital Status Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 17 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES 18 9 Apply Model to Test Data Test Data Refund Refund Marital Status Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Married Single, Divorced TaxInc < 80K NO > 80K YES NO 19 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES 20 10 Apply Model to Test Data Test Data Refund Refund Marital Status Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 21 Apply Model to Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married Assign Cheat to “No” NO > 80K YES 22 11 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Data Mining looks to find what algorithm works best for your data. - black/gray box approach Model Training Set Attrib3 Different tools come with slightly different sets of algorithms The main ones are …. 10 Attrib2 The machine learning algorithms do all the hard work. Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set 23 Training Dataset 12 Resultant Decision Tree Algorithm for Decision Tree Induction §  Basic algorithm (a greedy algorithm) §  §  §  §  §  Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) §  Conditions for stopping partitioning §  All samples for a given node belong to the same class §  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf §  There are no samples left 26 13 Extracting Classification Rules from Trees §  §  §  §  §  §  Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” 27 Decision Tree in SAS 28 14 Decision Tree in Oracle Classification Accuracy Measures Classification accuracy can be measured in many different ways Two of the most popular are: §  Classification / Misclassification rates §  Other associated values §  Primarily used for binary classification §  Average squared error §  Typically used with Regression / Prediction / Continuous values §  Used in conjunction with Classification / Misclassification rates More on measuring Classification Accuracy later 15 Random Forests §  Random forests §  are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random Forests 16 Data Splitting – For Training only 70% / 30% 60% / 40% Fool’s Gold 17 Fool’s Gold Model Complexity Too flexible Not flexible enough 18 Overfitting Training Set Test Set Overfitting (cont…) Training Set Test Set 19 Avoiding Overfitting In Classification An induced tree may over-fit the training data §  Too many branches, some may reflect anomalies due to noise or outliers §  Poor accuracy for unseen samples Two approaches to avoiding overfitting §  Prepruning: Halt tree construction early §  Do not split a node if this would result in a measure of the usefullness of the tree falling below a threshold §  Difficult to choose an appropriate threshold Can involve a lot of §  Application / Software / Algorithm Settings human inves0ga0on §  Postpruning: Remove branches from a “fully grown” tree to give a sequence of progressively pruned trees §  Use a set of data different from the training data to decide which is the “best pruned tree” §  Write code to manage/merge the different nodes/branches §  Will require some time to investigate the best options for this Approaches to Determine the Final Tree Size §  Separate training (2/3) and testing (1/3) sets §  Use cross validation, e.g., 10-fold cross validation §  Use all the data for training §  But very difficult to evaluate the models §  Representative sample of data in Training and Test data sets §  Not always possible to do §  Most DM tools will sample the data to give the training and testing data sets §  Use minimum description length (MDL) principle: §  halting growth of the tree when the encoding is minimized 40 20 Classification in Large Databases §  Classification—a classical problem extensively studied by statisticians and machine learning researchers §  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed §  Why decision tree induction in data mining? §  relatively faster learning speed (than other classification methods) §  convertible to simple and easy to understand classification rules §  can use SQL queries for accessing databases §  comparable classification accuracy with other methods 41 Supervised vs. Unsupervised Learning §  Supervised learning (classification) §  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations §  New data is classified based on the training set §  Unsupervised learning (clustering) §  The class labels of training data is unknown §  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 42 21 Case-Based Reasoning § Case based reasoning is a classification technique which uses prior examples (cases) to classify unknown cases § Uses lazy evaluation and analysis of similar instances § Methodology §  Instances represented by rich symbolic descriptions – similarity measures §  Multiple retrieved cases may be combined §  Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Case-Based Reasoning (cont…) § The k-nearest neighbour (k-NN) algorithm is the simplest form of case based reasoning § However, instances are not necessarily “points in a Euclidean space” § Lots of active research issues 22 The Nearest Neighbor Algorithm § All instances correspond to points in n-D space § The nearest neighbours are defined in terms of Euclidean distance (or other appropriate measure) § The target value can be discrete or real-valued § For discrete targets, k-NN returns the most common value among the k training examples nearest to the query § For real-valued targets, k-NN returns a combination (e.g. average) of the nearest neighbours’ target values Nearest Neighbour Example Wave Size (ft) Wave Period (secs) 6 15 Good Surf? Yes 1 6 No 5 11 Yes 7 10 Yes 6 11 Yes 2 1 No 3 4 No 6 12 Yes 4 2 No Query 10 10 ? 23 Nearest Neighbour Example Wave Size (ft) Wave Period (secs) 6 15 1 6 No 5 11 Yes 7 10 Yes 6 11 Yes 2 1 No 3 4 No 6 12 Yes 4 2 No 6 15 Yes Good Surf? Yes 1 6 No 5 11 Yes 7 10 Yes 6 11 Yes 2 1 No 3 4 No 6 12 Yes 4 2 No Get gets more difficult as the number of cases increase Nearest Neighbour Example When a new case is to be classified: §  Calculate the distance from the new case to all training cases §  Put the new case in the same class as its nearest neighbour f1 Wave Size ? ? ? Wave Period f2 24 k-Nearest Neighbour Example What about when it’s too close to call? Use the k-nearest neighbour technique f1 Wave Size 2 1 §  Determine the k nearest neighbours to the query case §  Put the new case into the same class as the majority of its nearest neighbours neighbours vs. neighbour ? f2 Wave Period © Brendan Tierney [email protected] Nearest Neighbour Distance Measures § Any kind of measurement can be used to calculate the distance between cases § The measurement most suitable will depend on the type of features in the problem § Euclidean distance is the most used technique d= n ∑ (t i − qi ) 2 i =1 § where n is the number of features, ti is the ith feature of the training case and qi is the ith feature of the query case 25 Classification by Decision Tree Induction §  Decision tree §  Decision trees are the most widely used classification technique in data mining today §  Formulate problems into a tree composed of decision nodes (or branch nodes) and classification nodes (or leaf nodes) §  Internal node denotes a test on an attribute §  Branch represents an outcome of the test §  Leaf nodes represent class labels or class distribution §  Problem is solved by navigating down the tree until we reach an appropriate leaf node §  The tricky bit is building the most efficient and powerful tree §  Decision tree generation consists of two phases §  Tree construction §  At start, all the training examples are at the root §  Partition examples recursively based on selected attributes §  Tree pruning §  Identify and remove branches that reflect noise or outliers §  Use of decision tree: Classifying an unknown sample §  Test the attribute values of the sample against the decision tree 51 52 26 Classification is about explaining the past and predicting the future. 27 Bayesian Classification: Why? §  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems §  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. §  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities §  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 55 Bayesian Theorem §  Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P(h | D) = P(D | h)P(h) P(D) §  MAP (maximum posteriori) hypothesis h ≡ arg max P(h | D) = arg max P(D | h)P(h). MAP h∈H h∈H §  Practical difficulty: require initial knowledge of many probabilities, significant computational cost 56 28 Bayesian classification §  The classification problem may be formalized using a-posteriori probabilities: §  P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. §  E.g. P(class=N | outlook=sunny,windy=true,…) §  Idea: assign to sample X the class label C such that P(C|X) is maximal 57 Estimating a-posteriori probabilities §  Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) §  P(X) is constant for all classes §  P(C) = relative freq of class C samples §  C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum §  Problem: computing P(X|C) is unfeasible! 58 29 Naïve Bayesian Classification §  Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) §  If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C §  If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function §  Computationally easy in both cases 59 Play-tennis example: estimating P(xi|C) 60 30 Play-tennis example: estimating P(xi|C) 61 Play-tennis example: classifying X §  An unseen sample X = <rain, hot, high, false> §  P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 §  P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 §  Sample X is classified in class n (don’t play) 62 31 Naïve Bayesian Classifier: Comments Advantages : §  Easy to implement §  Good results obtained in most of the cases Disadvantages §  §  §  §  §  Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? §  Bayesian Belief Networks 63 Neural Networks •  Very good results •  Suitable for a large number of problem areas •  But •  Very slow at building models. Too much complexity •  Cannot be used in many “Real World” scenarios •  Cannot explain how or why it is doing •  Cannot integrate into applications •  Cannot do real time 32 Support Vector Machines (SVM) §  In classification problems we try to create decision boundaries between classes §  A choice must be made between possible boundaries 65 SVMs (cont…) §  The decision boundary should be as far away from the data of both classes as possible 66 33 Margins 67 SVMs: The Clever Bit! §  What about when classes are not linearly separable? §  Kernel functions and the kernel trick are used to transform data into a different linearly separable feature space 68 34 SVMs: The Clever Bit! (cont...) §  What if the data is not linearly separable? §  Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) 69 Summary Of SVM Classification §  Strengths §  §  §  §  Over-fitting is not common Works well with high dimensional data Fast classification Good generalization capacity §  Weaknesses §  Retraining is difficult §  No explanation capability §  Slow training §  At the cutting edge of machine learning 70 35 Single Class – Support Vector Machine §  What if we have data with just one class value §  Will try to map data into MORE Like and LESS Like §  Anomaly Detection §  If the prediction is 1, the case is considered typical. §  If the prediction is 0, the case is considered anomalous. §  You can specify the percentage of the data that you expect to be anomalous. §  If you have some knowledge that the number of "suspicious" cases is a certain percentage of your population, then you can set the outlier rate to that percentage. §  The model will identify approximately that many "rare“ cases when applied to the general population. The default is 10%, which is probably high for many anomaly detection problems. Single Class – Support Vector Machine – Fraud Detection … … §  Highest probability of anomaly contain highly questionable data values when comparing the make and age of the wrecked vehicle with the reported value of that vehicle. §  This data may indicate the need for further investigation of these particular claims for possible fraud 36 What Is Prediction? §  Prediction is similar to classification §  First, construct a model §  Second, use model to predict unknown an value §  Major method for prediction is regression §  Linear and multiple regression §  Non-linear regression §  Prediction is different from classification §  Classification refers to predict categorical class label §  Prediction models continuous-valued functions 73 Linear Regression Prediction Formula input measurement ˆ0 + w ˆ 1 ⋅ x1 + w ˆ 2 ⋅ x2 yˆ = w intercept estimate prediction estimate parameter estimate Choose intercept and parameter estimates to minimize. squared error function ∑ (y i 2 − yˆ i ) training data 74 37 Regression Example §  Engineers concerned about the stability of the Leaning Tower of Pisa have done extensive studies of its increasing tilt. The following table gives measurements for the years 1975 to 1987. §  The variable “lean” represents the difference between where a point on the tower would be if the tower were straight and where it actually is. The lean is measured in meters. §  Where will the tower be in 1988 or 1989? 75 Regression Example ? Year 1988 Tilt 1989 1990 1995 1988 1989 1990 1995 ? ? ? ? 38 Regression Example Pros/Cons Of Regression Analysis §  Strengths §  Mathematically very well established §  Can be readily understood §  Can be extended to handle non-linear problems (polynomial regression) §  Weaknesses §  Does not cope with missing data §  Some problems with categorical data 78 39 Logistic Regression §  Binary Logistic Regression is a type of regression analysis where the dependent variable is coded as 0 or 1. §  But §  With our data this is not always possible §  §  §  §  §  Life time value Income Profit Usage …. §  Has a Numeric target/value 40 Can you think of problems can this be used on? Regression Measures R-squared is the percentage of the response variable variation that is explained by a linear model. R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: Residual = Observed value - Fitted value 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data. 41 Regression Measures Regression Prediction §  Can be difficult to predict the correct value §  Does the predicted value make sense §  Is there a difference between 1001 and 1005 §  Sometimes the continuous value is Binned into ranges §  Can then use binary classification for this §  Only if there is a small-ish number of Bins 42 Concerns Over Classification Techniques §  When choosing a technique for a specific classification problem we must consider the following issues: §  §  §  §  §  §  §  Classification accuracy Training speed Classification speed Danger of over-fitting Generalisation capacity Implications for retraining Explanation capability 85 Evaluating Classification Accuracy §  During development, and in testing before deploying a classifier in the wild, we need to be able to quantify the performance of the classifier §  How accurate is the classifier? §  When the classifier is wrong, how is it wrong? §  Useful to decide on which classifier (which parameters) to use and to estimate what the performance of the system will be 86 43 Evaluating Classifiers (cont…) §  How we do this depends on how much data is available §  If there is unlimited data available then there is no problem §  Usually we have less data than we would like so we have to compromise §  Use hold-out testing sets §  Cross validation §  K-fold cross validation §  Leave-one-out validation §  Parallel live test 87 Hold-Out Testing Sets §  Split the available data into a training set and a test set §  Train the classifier in the training set and evaluate based on the test set §  A couple of drawbacks §  We may not have enough data §  We may happen upon an unfortunate split 88 44 K-Fold Cross Validation §  Divide the entire data set into k folds §  For each of k experiments, use kth fold for testing and everything else for training 89 K-Fold Cross Validation (cont…) §  The accuracy of the system is calculated as the average error across the k folds §  The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided §  Any value can be used for k §  10 is most common §  Depends on the data set 90 45 Leave-One-Out Cross Validation §  Extreme case of k-fold cross validation §  With N data examples perform N experiments with N-1 training cases and 1 test case 91 Classifier Accuracy §  The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier §  Often also referred to as recognition rate §  Error rate (or misclassification rate) is the opposite of accuracy 92 46 False Positives Vs False Negatives §  While it is useful to generate the simple accuracy of a classifier, sometimes we need more §  When is the classifier wrong? §  False positives vs false negatives §  Related to type I and type II errors in statistics §  Often there is a different cost associated with false positives and false negatives §  Think about diagnosing diseases 93 Confusion Matrix §  Device used to illustrate how a classifier is performing in terms of false positives and false negatives §  Gives us more information than a single accuracy figure §  Allows us to think about the cost of mistakes §  Can be extended to any number of classes 94 47 Type I & Type II Errors 48 Which is Better ? 49 ROC Curves §  Receiver Operating Characteristic (ROC) curves were originally used to make sense of noisy radio signals §  Can be used to help us talk about classifier performance and determine the best operating point for a classifier §  Consider how the relationship between true positives and false positives can change §  We need to choose the best operating point 100 50 ROC Curves (cont…) §  ROC curves can be used to compare classifiers §  The greater the area under the curve the more accurate the classifier 101 Over-Fitting §  When we train a classifier we are trying to a learn a function approximated by the training data we happen to use §  What if the training data doesn’t cover the whole problem space? §  We can learn the training data too closely which hampers the ability to generalise §  This problem is known as over-fitting §  Depending on the type of classifier used there are different approaches to avoiding this 102 51 The real test or evaluation of a model §  Deploy the model §  Monitor the data based on classifications §  Does the data generated from Model reflect real world §  How long to monitor for ? §  Gives true evaluation Model Update §  Do we need to update the models ? §  New (initial) Model §  Reflects the current position Historical Data Model New Data §  We keep gathering new data/cases §  Update a Model §  How often do we need to update the models ? 52 The real test or evaluation of a model Real World Scenario §  Decision Tree model selected : based on outcomes of Testing & Evaluating the models §  Deploy in Production J §  Monitor Production : We are getting lots and lots of mis-classification L §  Why ? §  One of the other models (Niave Bayes) gave the based Production results J §  You need to constantly monitor and look for behavior. Assessing the Deployed Model §  Design experiments §  §  §  §  §  §  Compare actual results against expectations. Compare the challenger’s results against the champion’s. Did the model find the right people? Did the action affect their behaviour? What are the characteristics of the customers most affected by the intervention? The proof of the pudding is in the eating 53 Summary §  Classification is an extensively studied problem §  Classification is probably one of the most widely used data mining techniques with a lot of extensions §  Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic §  Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc. classification of skewed data sets 107 54

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Classification