Download Chapter 5 Data Mining a Closer Look

Chapter 5 Data mining : A Closer Look Chapter Objectives  Determine an appropriate data mining strategy for a specific problem.  Know about several data mining techniques and how each technique builds a generalized model to represent data.  Understand how a confusion matrix is used to help evaluate supervised learner models. Data Warehouse and Data Mining 2 Chapter 5 Chapter Objectives Understand basic techniques for evaluating supervised learner models with numeric output. Know how measuring lift can be used to compare the performance of several competing supervised learner models. Understand basic techniques for evaluating unsupervised learner models. Data Warehouse and Data Mining 3 Chapter 5 Data Mining Strategies Classification is probably the best understood of all data mining strategies. Classification tasks have three common characteristics. • Learning is supervised. • The dependent variable is categorical. • The emphasis is on building models able to assign new instances to one of a set of welldefined classes. Data Warehouse and Data Mining 4 Chapter 5 Data Mining Strategies • Some example classification tasks include the following: •Determine those characteristics that differentiate individuals who have suffered a heart attack from those who have not. • Develop a profile of a “successful” person. • Determine if a credit card purchase is fraudulent. • Classify a car loan applicant as a good or a poor credit risk. • Develop a profile to differentiate female and male stroke victims. Data Warehouse and Data Mining 5 Chapter 5 Data Mining Strategies Data Warehouse and Data Mining 6 Chapter 5 Data Mining Strategies Data Warehouse and Data Mining 7 Chapter 5 Data Mining Strategies Data Warehouse and Data Mining 8 Chapter 5 Data Mining Strategies Data Warehouse and Data Mining 9 Chapter 5 Data Mining Strategies 34% are healthy within these max heart rate range Data Warehouse and Data Mining 10 Chapter 5 Supervised Data Mining Techniques Data Warehouse and Data Mining 11 Chapter 5 Supervised Data Mining Techniques Data Warehouse and Data Mining 12 Chapter 5 Supervised Data Mining Techniques Data Warehouse and Data Mining 13 Chapter 5 Supervised Data Mining Techniques Data Warehouse and Data Mining 14 Chapter 5 Supervised Data Mining Techniques Data Warehouse and Data Mining 15 Chapter 5 Association Rules Data Warehouse and Data Mining 16 Chapter 5 Clustering Techniques Data Warehouse and Data Mining 17 Chapter 5 Clustering Techniques Data Warehouse and Data Mining 18 Chapter 5 Evaluating Performance Data Warehouse and Data Mining 19 Chapter 5 Evaluating Performance Data Warehouse and Data Mining 20 Chapter 5 Evaluating Performance Data Warehouse and Data Mining 21 Chapter 5 Evaluating Performance Data Warehouse and Data Mining 22 Chapter 5 Evaluating Performance Data Warehouse and Data Mining 23 Chapter 5 Chapter Summary Data mining strategies include classification, estimation, prediction, unsupervised clustering, and market basket analysis. Classification and estimation strategies are similar in that each strategy is employed to build models able to generalize current outcome. However, the output of a classification strategy is categorical, whereas the output of an estimation strategy is numeric. Data Warehouse and Data Mining 24 Chapter 5 Chapter Summary A predictive strategy differs from a classification or estimation strategy in that it is used to design models for predicting future outcome rather than current behavior. Unsupervised clustering strategies are employed to discover hidden concept structures in data as well as to locate atypical data instances. The purpose of market basket analysis is to find interesting relationships among retail products. Discovered relationships can be used to design promotions, arrange shelf or catalog items, or develop crossmarketing strategies. Data Warehouse and Data Mining 25 Chapter 5 Chapter Summary A data mining technique applies a data mining strategy to a set of data. Data mining techniques are defined by an algorithm and a knowledge structure. Common features that distinguish the various techniques are whether learning is supervised or unsupervised and whether their output is categorical or numeric. Data Warehouse and Data Mining 26 Chapter 5 Chapter Summary Familiar supervised data mining techniques include decision tree methods, production rule generators, neural networks, and statistical methods. Association rules are a favorite technique for marketing applications. Clustering techniques employ some measure of similarity to group instances into disjoint partitions. Clustering methods are frequently used to help determine a best set of input attributes for building supervised learner models. Data Warehouse and Data Mining 27 Chapter 5 Chapter Summary Performance evaluation is probably the most critical of all the steps in the data mining process. Supervised model evaluation is often performed using a training/test set scenario. Supervised models with numeric output can be evaluated by computing average absolute or average squared error differences between computed and desired outcome. Data Warehouse and Data Mining 28 Chapter 5 Chapter Summary Marketing applications that focus on mass mailings are interested in developing models for increasing response rates to promotions. A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above those achieved by naïve (mass) mailing strategies. Unsupervised models support some measure of cluster quality that can be used for evaluative purposes. Supervised learning can also be employed to evaluate the quality of the clusters formed by an unsupervised model. Data Warehouse and Data Mining 29 Chapter 5 Key Terms Association rule. A production rule whose consequent may contain multiple conditions and attribute relationships. An output attribute in one association rule can be an input attribute in other rule. Classification. A supervised learning strategy where the output attribute is categorical. The emphasis is on building models able to assign new instances to one of a set of well-defined classes. Confusion matrix. A matrix used to summarize the results of a supervised classification. Entries along the main diagonal represent the total number of correct classifications. Entries other than those on the main diagonal represent classification errors. Data Warehouse and Data Mining 30 Chapter 5 Key Terms Data mining strategy. An outline of an approach for problem solution. Data mining technique. One or more algorithms together with an associated knowledge structure. Dependent variable. A variable whose value is determined by a combination of one or more independent variables. Estimation. A supervised learning strategy where the output attribute is numeric. Emphasis is on determining current rather than future outcome. Data Warehouse and Data Mining 31 Chapter 5 Key Terms Independent variable. An input attribute used for building supervised or unsupervised learner models. Lift. The probability of class Ci given a sample taken from population P divided by the probability of Ci given the entire population P. Lift chart. A graph that displays the performance of a data mining model as a function of sample size. Linear regression. A supervised learning technique that generalizes numeric data as a linear equation. The equation defines the value of an output attribute as a linear sum of weighted input attribute values. Data Warehouse and Data Mining 32 Chapter 5 Key Terms Market basket analysis. A data mining strategy that attempts to find interesting relationships among retail products. Mean absolute error. For a set of training or test set instances, the mean absolute error is the average absolute difference between classifier predicted output and actual output. Mean squared error. For a set of training or test set instances, the mean squared error is the average of the sum of squared differences between classifier predicted output and actual output. Neural network. A set of interconnected nodes designed to imitate the functioning of the human brain. Data Warehouse and Data Mining 33 Chapter 5 Key Terms Outliers. Atypical data instances. Prediction. A supervised learning strategy designed to determine future outcome. Root mean squared error. The square root of the mean squared error. Rule Maker. A supervised learner model for generating production rules from data. Statistical regression. A supervised learning technique that generalizes numerical data as a mathematical equation. The equation defines the value of an output attribute as a sum of weighted input attribute values. Data Warehouse and Data Mining 34 Chapter 5 Data Warehouse and Data Mining 35 Chapter 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 5 Data Mining a Closer Look