Download Summary - DataMiningConsultant.com

Discovering Knowledge in Data: An Introduction to Data Mining By Daniel T. Larose, Ph.D. Chapter Summaries and Keywords Preface. The preface begins by discussing why Discovering Knowledge in Data: An Introduction to Data Mining (DKD) is needed. Because of the powerful data mining software platforms currently available, a strong caveat is given against glib application of data mining methods and techniques. In other words, data mining is easy to do badly. The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. DKD applies this white-box approach by (1) walking the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm, (2) providing examples of the application of the various algorithms on actual large data sets, (3) supplying chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data, and (4) providing the reader with hands-on analysis problems, representing an opportunity for the reader to apply his or her newly-acquired data mining expertise to solving real problems using large data sets. DKD presents data mining as a well-structured standard process, namely, the Cross-Industry Standard Process for Data Mining (CRISP-DM). DKD emphasizes a graphical approach to data analysis, stressing in particular exploratory data analysis. DKD naturally fits the role of textbook for an introductory course in data mining. Instructors may appreciate (1) the presentation of data mining as a process, (2) the “White box” approach, emphasizing an understanding of the underlying algorithmic structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the logical presentation, flowing naturally from the CRISP-DM standard process and the set of data mining tasks. DKD is appropriate for advanced undergraduate or graduate-level courses. No computer programming or database expertise is required. Keywords: Algorithm walk-throughs, hands-on analysis problems, chapter exercises, “white-box” approach, data mining as a process, graphical and exploratory approach. ---------------------------------------- Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 1 Chapter One: An Introduction to Data Mining Chapter One begins by defining data mining, investigating why it is needed and how widespread data mining is. One of the strong themes of DKD is the need for human direction and supervision of data mining at all levels. The CRISP-DM standard process is described, so that that data miners within a corporate or research enterprise do not suffer from isolation. The six phases of the standard process are described in detail, including (1) the business or research understanding phase, (2) the data understanding phase, (3) the data preparation phase, (4) the modeling phase, (5) the evaluation phase, and (6) the deployment phase. A case study from Daimler-Chrysler illustrating these phases is provided. Common fallacies of data mining are debunked. The primary tasks of data mining are described in detail, including description, estimation, prediction, classification, clustering, and association. Finally four case studies are examined, in relation to the phases of the standard process, and the set of data mining tasks. Keywords: business or research understanding phase, data understanding phase, data preparation phase, modeling phase, evaluation phase, and deployment phase, description, estimation, prediction, classification, clustering, and association. ---------------------------------------- Chapter Two: Data Preprocessing Chapter Two begins by explaining why data preprocessing is needed. Much of the raw data contained in databases is unpreprocessed, incomplete, and noisy. Therefore, in order to be useful for data mining purposes, the databases need to undergo preprocessing, in the form of data cleaning and data transformation. The overriding objective is to minimize GIGO, to minimize the Garbage that gets Into our model, so that we can minimize the amount of Garbage that our models give Out. Data preparation alone accounts for 60% of all the time and effort for the entire data mining process. Much of the data contains field values that have expired, are no longer relevant, or are simply missing. Examples of data cleaning are provided. Different methods of handling missing values are provided, including replacing missing field values with user-defined constants, means, or random draws from the variable distribution. Identifying misclassifications is discussed, as well as graphical methods for identifying outliers. Methods of transforming data are provided, including min-max normalization, and zscore standardization. Finally, numerical methods for identifying outliers are examined, including the IQR method. The hands-on analysis problems include challenging the reader to preprocess the large churn data set, so that it will be ready for further analysis downstream. Keywords: Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 2 Data preprocessing, noisy data, data cleaning, data transformation, minimize GIGO, data preparation, missing data, misclassifications, identifying outliers, min-max normalization, z-score standardization. ---------------------------------------Chapter Three: Exploratory Data Analysis Chapter Three begins by discussing the difference between hypothesis testing, where an a priori hypothesis is assessed, versus exploratory data analysis. Exploratory data analysis (EDA), or graphical data analysis allows the analyst to (1) delve into the data set, (2) examine the inter-relationships among the attributes, (3) identify interesting subsets of the observations, and (4) develop an initial idea of possible associations between the attributes and the target variable, if any. Extensive screen-shots of EDA carried out on the churn data set are provided. Graphical and numerical methods for handling correlated variables are examined. Next, exploratory methods for categorical variables are investigated. Bar charts with overlays, normalized bar charts, and subsetted bar charts are illustrated, along with crosstabulations and web graphs for investigating the relationship between two categorical attributes. Throughout the chapter, nuggets of information are drawn from the EDA that will help us to predict churn, that is, predict who will leave the company’s service. EDA can help to uncover anomalous fields. Next, exploratory methods for numerical variables are examined, including summary statistics, correlations, histograms, normalized histograms, and histograms with overlays. Then, exploratory methods for multivariate relationships are illustrated, including scatter plots and 3D scatter plots. Methods for selecting interesting subsets of records are discussed, along with binning to increase efficiency. The hands-on analysis problems include challenging the reader to construct a comprehensive exploratory data analysis of a large real-world data set, the adult data set, and reporting on the salient results. Keywords: Graphical data analysis, inter-relationships among variables, handling correlated variables, categorical variables, normalized bar charts, crosstabulations, web graphs, predicting churn, anomalous fields, summary statistics, histograms, scatter plots, binning. ---------------------------------------Chapter Four: Statistical Approaches to Estimation and Prediction If estimation and prediction are considered to be data mining tasks, then statistical analysts have been performing data mining for over a century. Chapter Four examines some of the more widespread and traditional methods of estimation and prediction, drawn from the world of statistical analysis. We begin by examining univariate methods, statistical estimation and prediction methods that analyze one variable at a time. Measures of center and location are defined, such as the mean, median, and mode, followed by measures of spread or variability, such as the range and standard deviation. Discovering Knowledge in Data: An Introduction to Data Mining 3 Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. The process of statistical inference is described, using sample statistics to estimate the unknown value of population parameters. Sampling error is defined, leading to methods to measure the confidence of our estimates, such as the margin of error. The univariate methods of point estimation and confidence interval estimation are discussed. Next we consider simple linear regression, where the relationship between two numerical variables is investigated. Standard errors are discussed, along with the meaning of the regression coefficients. The dangers of extrapolation are illustrated, as are confidence intervals and prediction intervals in the context of regression. Finally, we examine multiple regression, where the relationship between a response variable and a set of predictor variables is modeled linearly. Methods are shown for verifying regression model assumptions. The hands-on analysis problems include generating simple linear regression and multiple regression models for predicting nutrition rating for breakfast cereals using the cereals data set. Keywords: Estimation, prediction, univariate methods, measures of center, mean, median, mode, measures of spread, range, standard deviation, statistical inference, sample statistics, population parameters, sampling error, margin of error, point estimation, confidence interval estimation, simple linear regression, prediction intervals, multiple regression, response variable, predictor variables, model assumptions. ---------------------------------------Chapter Five: The K-Nearest Neighbor Algorithm Chapter Five begins with a discussion of the differences between supervised and unsupervised methods. In unsupervised methods, no target variable is identified as such. Most data mining methods are supervised methods, however, meaning that (a) there is a particular pre-specified target variable, and (b) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables. A general methodology for supervised modeling is provided, for building and evaluating a data mining model. The training data set, test data set, and validation data sets are discussed. The tension between model overfitting and underfitting is illustrated graphically, as is the bias-variance tradeoff. High complexity models are associated with high accuracy and high variability. The mean-square error is introduced, as a combination of bias and variance. The general classification task is recapitulated. The knearest neighbor algorithm is introduced, in the context of a patient-drug classification problem. Voting for different values of k are shown to sometimes lead to different results. The distance function, or distance metric, is defined, with Euclidean distance being typically chosen for this algorithm. The combination function is defined, for both simple unweighted voting and weighted voting. Stretching the axes is shown as a method for quantifying the relevance of various attributes. Database considerations, such as balancing, are discussed. Finally, k-nearest neighbor methods for estimation and prediction are examined, along with methods for choosing the best value for k. Discovering Knowledge in Data: An Introduction to Data Mining 4 Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. Keywords: Supervised methods, unsupervised methods, a general methodology for supervised modeling, training set, test set, validation set, overfitting, underfitting, the bias-variance tradeoff, model complexity, mean-square error, classification task, drug classification, distance function, combination function, weighted voting, unweighted voting, stretching the axes. ---------------------------------------Chapter Six: Decision Trees Chapter Six begins with the general description of a decision tree, as a collection of decision nodes, connected by branches, extending downward from the root node until terminating in leaf nodes. Beginning at the root node, attributes are tested at the decision nodes, with each possible outcome resulting in a branch. Each branch then leads either to another decision node or to a terminating leaf node. Node “purity” is introduced, leading to a discussion of how one measures “uniformity” or “heterogeneity”. Two of the main methods for measuring leaf node purity lead to the two leading algorithms for constructing decision trees, discussed in this chapter, Classification and Regression Trees (CART), and the C4.5 Algorithm. The CART algorithm is walked-through, using a very small data set for classifying credit risk, based on savings, assets, and income. The goodness of the results is assessed using the classification error measure. Next, we turn to the C4.5 algorithm. Unlike CART, the C4.5 algorithm is not restricted to binary splits. For categorical attributes, C4.5 by default produces a separate branch for each value of the categorical attribute. The C4.5 method for measuring node homogeneity is quite different from CART’s, using the concept of information gain or entropy reduction for selecting the optimal split. These terms are defined and explained. Then the C4.5 algorithm is walked-through, using the same credit risk data set as above. Information gain for each candidate split is calculated and compared, so that the best split at each node may be identified. The resulting decision tree is compared with the CART model. The generation of decision rules from trees is illustrated. Then, an application and comparison of the CART and C4.5 algorithms to a real-world large data set is performed, using Clementine software. Differences between the models are discussed. The exercises include a challenge to readers to construct CART and C4.5 models by hand, using the small salary-classification data set provided. The hands-on analysis problems include generating CART and C4.5 models for the large churn data set. Keywords: Decision node, branches, root node, leaf node, node purity, node heterogeneity, classification and regression trees, the C4.5 algorithm, classification error, binary splits, information gain, entropy reduction, optimal split, decision rules. ---------------------------------------- Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 5 Chapter Seven: Neural Networks Chapter Seven begins with a brief comparison of the functions of artificial neurons with real neurons. The need for input and output encoding is discussed, followed by the use of neural networks for estimation and prediction. A simple example of a neural network is presented, so that its topology and structure may be examined. A neural network consists of a layered, feed-forward, completely-connected network of artificial neurons, or nodes. The neural network is composed of two or more layers, though most networks consist of three layers, an input layer, a hidden layer, and an output layer. There may be more than one hidden layer, though most networks contain only one, which is sufficient for most purposes. Each connection between nodes has a weight associated with it. The combination function and the sigmoid activation function are examined, using a small data set for illustration. The backpropagation algorithm is described, as a method for minimizing the sum of squared errors. The gradient descent method is explained and illustrated. Backpropagation rules are discussed, as a method for distributing error responsibility throughout the network. A walk-through of the backpropagation algorithm is performed, showing how the weights are adjusted, using a tiny sample data set. Termination criteria are discussed, as are benefits and drawbacks of using neural networks. The learning rate and the momentum term are motivated and illustrated. Sensitivity analysis is explained, for identifying the most influential attributes. Finally, an application of neural networks to a real-world large data set is carried out, using Insightful Miner. The resulting network topology and connection weights are investigated. The hands-on analysis problems include generating neural network models for the large churn data set. Keywords: Network topology, layered network, feed-forward, completely-connected, input layer, hidden layer, output layer, connection weights, combination function, activation function, sigmoid function, backpropagation algorithm, sum of squared errors, gradient descent method, backpropagation rules, error responsibility, termination criteria, learning rate, momentum term, sensitivity analysis. ---------------------------------------Chapter Eight: Hierarchical and K-Means Clustering Chapter Eight begins with a review of the clustering task, and the concept of distance. Good clustering algorithms seek to construct clusters of records such that the betweencluster variation is large compared to the within-cluster variation. First, hierarchical clustering methods are examined. In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters. Single-linkage, complete-linkage, and averagelinkage methods are discussed. The single-linkage and complete-linkage clustering algorithms are walked-through, using a small univariate data set. Differences in the resulting dendrogram structure are discussed. The average-linkage algorithm is shown to produce the same dendrogram as the complete-linkage algorithm, for this data set, Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 6 though not necessarily in general. Next, we turn to the k-means clustering algorithm, beginning with the definition of the steps involved in the algorithm. Cluster centroids are defined. The k-means algorithm is walked-through, using a tiny bivariate data set, showing graphically how the cluster centers are updated. An application of k-means clustering to the large churn data set is undertaken, using SAS Enterprise Miner. The resulting clusters are profiled. Finally, the methodology of using cluster membership for further analysis downstream is illustrated, with the clusters identified by SAS Enterprise Miner helping to predict churn. The exercises include challenges to readers to construct single-linkage, complete-linkage, and k-means clustering solutions for small univariate and bivariate data sets. The hands-on analysis problems include generating k-means clusters using the cereals data set, and applying these clusters to help predict nutrition rating. Keywords: Clustering task, between-cluster variation, within-cluster variation, hierarchical clustering, dendrogram, divisive methods, agglomerative methods, single-linkage, complete-linkage, average-linkage, cluster centroids, cluster profiles. ---------------------------------------Chapter Nine: Kohonen Networks Chapter Nine begins with a discussion of self-organizing maps (SOM’s), a special class of artificial neural networks, whose goal is to convert a complex high-dimensional input signal into a simpler low-dimensional discrete map. SOM’s are based on competitive learning, where the output nodes compete amongst themselves to be the winning node, becoming selectively tuned to various input patterns. The topology of SOM’s is illustrated. SOM’s exhibit three characteristic processes: competition, cooperation, and adaptation. In competition, the output nodes compete with each other to produce the best value for a particular scoring function, most commonly, the Euclidean distance. Cooperation refers to all the nodes in this neighborhood sharing in the “reward” earned by the winning nodes. Adaptation refers to the weight adjustment undergone by the connections to the winning nodes and their neighbors. Kohonen networks are then defined as a special class of SOM’s exhibiting kohonen learning. The Kohonen network algorithm is defined. A walk-through of the Kohonen network algorithm is provided, using a small data set. Cluster validity is discussed. An application of Kohonen network clustering is examined, using the churn data set. The topology of the network is compared to the Clementine results, showing how similar clusters are closer to each other. Detailed cluster profiles are obtained. The use of cluster membership for downstream classification is demonstrated. The hands-on analysis problems include applying Kohonen network clustering to a real-world large data set, the Adult data set, generating cluster profiles, and using cluster membership to classify the target variable income using CART and C4.5 models. Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 7 Keywords: Self-organizing maps, competitive learning, winning node, competition, cooperation, adaptation, scoring function, Kohonen learning, cluster validity. ---------------------------------------- Chapter Ten: Association Rules Chapter Ten begins with an introduction to affinity analysis, and a review of the association task. Two prevalent algorithms for association rule mining are examined in this chapter, the a priori algorithm, and the generalized rule induction (GRI) algorithm. The basic concepts and terms of association rule mining are introduced, in the context of market basket analysis, using a roadside vegetable stand example. Two different data representations for market basket analysis are shown, transactional data format, and tabular data format. Support, confidence, itemsets, frequency, and the a priori property are defined, along with the two-step process for mining association rules. A walkthrough of the a priori algorithm is provided, using the roadside vegetable stand data. First, the generation of frequent itemsets is shown, using the joining step and the pruning step. Then the method of generating association rules from the set of frequent itemsets is given, and the resulting association rules for the purchase of vegetables is illustrated. These rules are ranked by support x confidence, and compared to the rules generated by Clementine. Then, the extension of association rule mining from flag attributes to general categorical attributes is discussed, and an example given from a large data set. Next, the information-theoretic GRI algorithm is investigated. The J-measure, used by GRI to assess the interestingness of a candidate association rule, is examined, and the behavior of the J-measure is described. The GRI algorithm is then applied to a realworld data set, showing that it can be used for numerical variables as well as categorical. Readers are advised about when not to use association rules. Various rule choice criteria are compared for the a priori algorithm, including the confidence difference and confidence ratio criteria. Discussions take place regarding whether association rule mining represents supervised or unsupervised data mining, and of the difference between local patterns and global models. The exercises include challenges to readers to generate frequent itemsets and to mine association rules, by hand, from a well-known small data set. The hands-on analysis problems include mining association rules from the large churn data set, using both a priori and GRI, and comparing the results with findings from earlier in the text. Keywords: Affinity analysis, market basket analysis, the a priori algorithm, the generalized rule induction algorithm, transactional and tabular data formats, support, confidence, itemsets, frequent itemsets, the a priori property, joining step, pruning step, the Jmeasure, confidence difference criterion, confidence ratio criterion, local patterns, global models. ---------------------------------------Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 8 Chapter Eleven: Model Evaluation Techniques Chapter Eleven is structured so that the various model evaluation techniques are examined, classified by data mining task. First model evaluation techniques for the description task are briefly discussed, including transparency and the minimum descriptive length principle. Then model evaluation techniques for the estimation and prediction tasks are outlined, including estimation error (or residual), mean square error, and the standard error of the estimate. The heart of the chapter lies in the discussion of model evaluation techniques for the classification task. The following evaluative concepts, methods, and tools are discussed in the context of the C5.0 model for classifying income from Chapter Six: error rate, false positives, false negatives, the confusion matrix, error cost adjustment, lift, lift charts, and gains charts. The relation ship between these terms and the terminology of hypothesis testing is discussed, such as Type I error and Type II error. Misclassification cost adjustment is investigated, in order to reflect real-world concerns, and readers are shown how to adjust misclassifications costs using Clementine. Decision cost-benefit analysis is illustrated, showing how analysts can provide model comparison in terms of anticipated profit or loss. An example of estimated cost savings using this cost-benefit analysis is shown. Lift charts and gains charts are defined and illustrated, including a combined lift chart, which shows different models preferable over different areas. Emphasis is made on interweaving model evaluation with model building. Stress is laid upon the importance of achieving a confluence of results from a set of different models, thereby enhancing the analyst’s confidence in the findings. The hands-on analysis problems include using all of the techniques and methods learned in this chapter to select the best model from a set of candidate models, two CART models, two C4.5 models, and a neural network model. Keywords: Minimum descriptive length principle, error rate, false positives, false negatives, the confusion matrix, error (misclassification) cost adjustment, lift, lift charts, gains charts, Type I error, Type II error, decision cost-benefit analysis, confluence of results. ---------------------------------------- Discovering Knowledge in Data: An Introduction to Data Mining Chapter Summaries and Keywords Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Summary - DataMiningConsultant.com