Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Issues with Data Mining Data Mining involves Generalization • Data mining (Machine Learning) learns generalizations of the instances in training data – E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute – This means, generalizations predict (or describe) the behaviour of instances beyond the training data – This in turn means, knowledge is extracted from raw data using data mining • This knowledge drives end-user’s decision making process 2 Generalization as Search • The process of generalization can be viewed as searching a space of all possible patterns or models – For a pattern that fits the data • This view provides a standard framework for understanding all data mining techniques • E.g decision tree learning involves searching through all possible decision trees – Lecture 4 shows two example decision trees that fit the weather data – One of them is a better generalization than the other (Example 2) 3 Bias • Important choices made in a data mining system are – representation language– the language chosen to represent the patterns or models, – search method – the order in which the space is searched – model pruning method– the way overfitting to the training data is avoided • This means, each data mining scheme involves – Language bias – Search bias – Overfitting-avoidance bias 4 Language Bias • Different languages used for representing patterns and models – E.g. rules and decision trees • A concept fits a subset of training data – That subset can be described as a disjunction of rules – E.g classifier for the weather data can be represented as a disjunction of rules • Languages differ in their ability to represent patterns and models – This means, when a language with lower representation ability is used, the data mining system may not achieve good performance • Domain knowledge (external to training data) helps to cut down search space 5 Search Bias • An exhaustive search over the search space is computationally expensive • Search is speeded up by using heuristics – Pure children nodes indicate good tree stumps in decision tree learning • By definition heuristics cannot guarantee optimum patterns or models – Using information gain may mislead us to select a suboptimal attribute at the root • Complex search strategies possible – Those that pursue several alternatives parallelly – Those that allow backtracking • A high-level search bias – General-to-specific: start with a root node and grow the decision tree to fit the specific data – Specific-to-general: choose specific examples in each class and then generalize the class by including k-nearest neighbour examples 6 Overfitting-avoidance bias • We want to search for ‘best’ patterns and models • Simple models are the best • Two strategies – Start with the simplest model and stop building model when it starts to become complex – Start with a complex model and prune it to make it simpler • Each strategy biases search in a different way • Biases are unavoidable in practice – Each data mining scheme might involve a configuration of biases – These biases may serve some problems well • There is no universal best learning scheme! – We saw this in our practicals with Weka 7 Combining Multiple Models • Because there is no ideal data mining scheme, it is useful to combine multiple models – Idea of democracy – decisions made based on collective wisdom – Each data mining scheme acts like an expert using its knowledge to make decisions • Three general approaches – Bagging – Boosting – Stacking • Bagging and boosting both follow the same approach – Take a vote on the class prediction from all the different schemes – Bagging uses a simple average of votes while boosting uses a weighted average – Boosting gives more weight to more knowledgeable experts • Boosting is generally considered the most effective 8 Bias-Variance Decomposition • Assume – Infinite training data sets of the same size, n – Infinite number of classifiers trained on the above data sets • For any learning scheme – Bias = expected error of the classifier even after increasing training data infinitely – Variance = expected error due to the particular training set used • Total expected error = bias + variance • Combining multiple classifiers decreases the expected error by reducing the variance component 9 Bagging • Bagging stands for bootstrap aggregating • – Combines equally weighted predictions from multiple models Bagging exploits instability in learning schemes – Instability – small change in training data results in big change in model • Idealized version for classifier – Collect several independent training sets – Build a classifier from each training set • E.g learn a decision tree from each training set – The class of a test instance is the prediction that received most votes from all the classifiers • Practically it is not feasible to obtain several independent training sets 10 Bagging Algorithm • Involves two stages • Model Generation – Let n be the number of instances in the training data – For each of t iterations • Sample n instances with replacement from training data • Apply the learning algorithm to the sample • Store the resulting model • Classification – For each of the t models: • Predict class of instance using model – Return class that has been predicted most often 11 Boosting • Multiple data mining methods might complement each other – Each method performing well on a subset of data • Boosting combines complementing models – Using weighted voting • Boosting is iterative – Each new model is built to overcome the deficiencies in the earlier models • Several variants of boosting – AdaBoost.M1 – based on the idea of giving weights to instances • Boosting involves two stages – Model generation – Classification 12 Boosting • Model generation – Assign equal weight to each training instance – For each of t iterations: • Apply learning algorithm to weighted dataset and store resulting model • Compute error e of model on weighted dataset and store error • If e=0 or e>=0.5 – Terminate model generation • For each instance in dataset: – If instance classified correctly by model: – Multiply weight of instance by e/(1-e) • Normalize weight of all instances • Classification – Assign weight of zero to all classes – For each of the t (or less) models: • Add –log(e/(1-e)) to weight of class predicted by model – Return class with highest weight 13 Stacking • Bagging and boosting combine models of the same type – E.g. a set of decision trees • Stacking is applied to models of different types – Because voting may not work when different models do not perform comparably well, – Voting is problematic when two out of three classifiers perform poorly • Stacking uses a metalearner to combine different base learners – Base learners: level-0 models – Meta learner: level-1 model – Predictions of base learners fed as inputs to the meta learner • Base learner predictions on training data cannot be input to meta learner – Instead use cross-validation results on base learner • Because classification is done by base learners, meta14 learners use simple learning schemes Combining models using Weka • Weka offers methods to perform bagging, boosting and stacking over classifiers • In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu • AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly • You are encouraged to try these methods on your own 15