Download Summary - DataMiningConsultant.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Discovering Knowledge in Data: An Introduction to Data Mining
By Daniel T. Larose, Ph.D.
Chapter Summaries and Keywords
Preface.
The preface begins by discussing why Discovering Knowledge in Data: An Introduction
to Data Mining (DKD) is needed. Because of the powerful data mining software
platforms currently available, a strong caveat is given against glib application of data
mining methods and techniques. In other words, data mining is easy to do badly. The
best way to avoid these costly errors, which stem from a blind black-box approach to data
mining, is to instead apply a “white-box” methodology, which emphasizes an
understanding of the algorithmic and statistical model structures underlying the software.
DKD applies this white-box approach by (1) walking the reader through the operations
and nuances of the various algorithms, using small sample data sets, so that the reader
gets a true appreciation of what is really going on inside the algorithm, (2) providing
examples of the application of the various algorithms on actual large data sets, (3)
supplying chapter exercises, which allow readers to assess their depth of understanding of
the material, as well as have a little fun playing with numbers and data, and (4) providing
the reader with hands-on analysis problems, representing an opportunity for the reader to
apply his or her newly-acquired data mining expertise to solving real problems using
large data sets. DKD presents data mining as a well-structured standard process, namely,
the Cross-Industry Standard Process for Data Mining (CRISP-DM). DKD emphasizes a
graphical approach to data analysis, stressing in particular exploratory data analysis.
DKD naturally fits the role of textbook for an introductory course in data mining.
Instructors may appreciate (1) the presentation of data mining as a process, (2) the
“White box” approach, emphasizing an understanding of the underlying algorithmic
structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the
logical presentation, flowing naturally from the CRISP-DM standard process and the set
of data mining tasks. DKD is appropriate for advanced undergraduate or graduate-level
courses. No computer programming or database expertise is required.
Keywords:
Algorithm walk-throughs, hands-on analysis problems, chapter exercises, “white-box”
approach, data mining as a process, graphical and exploratory approach.
----------------------------------------
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
1
Chapter One: An Introduction to Data Mining
Chapter One begins by defining data mining, investigating why it is needed and how
widespread data mining is. One of the strong themes of DKD is the need for human
direction and supervision of data mining at all levels. The CRISP-DM standard process is
described, so that that data miners within a corporate or research enterprise do not suffer
from isolation. The six phases of the standard process are described in detail, including
(1) the business or research understanding phase, (2) the data understanding phase, (3)
the data preparation phase, (4) the modeling phase, (5) the evaluation phase, and (6) the
deployment phase. A case study from Daimler-Chrysler illustrating these phases is
provided. Common fallacies of data mining are debunked. The primary tasks of data
mining are described in detail, including description, estimation, prediction,
classification, clustering, and association. Finally four case studies are examined, in
relation to the phases of the standard process, and the set of data mining tasks.
Keywords:
business or research understanding phase, data understanding phase, data preparation
phase, modeling phase, evaluation phase, and deployment phase, description, estimation,
prediction, classification, clustering, and association.
----------------------------------------
Chapter Two: Data Preprocessing
Chapter Two begins by explaining why data preprocessing is needed. Much of the raw
data contained in databases is unpreprocessed, incomplete, and noisy. Therefore, in
order to be useful for data mining purposes, the databases need to undergo
preprocessing, in the form of data cleaning and data transformation. The overriding
objective is to minimize GIGO, to minimize the Garbage that gets Into our model, so that
we can minimize the amount of Garbage that our models give Out. Data preparation
alone accounts for 60% of all the time and effort for the entire data mining process.
Much of the data contains field values that have expired, are no longer relevant, or are
simply missing. Examples of data cleaning are provided. Different methods of handling
missing values are provided, including replacing missing field values with user-defined
constants, means, or random draws from the variable distribution. Identifying
misclassifications is discussed, as well as graphical methods for identifying outliers.
Methods of transforming data are provided, including min-max normalization, and zscore standardization. Finally, numerical methods for identifying outliers are examined,
including the IQR method. The hands-on analysis problems include challenging the
reader to preprocess the large churn data set, so that it will be ready for further analysis
downstream.
Keywords:
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
2
Data preprocessing, noisy data, data cleaning, data transformation, minimize GIGO, data
preparation, missing data, misclassifications, identifying outliers, min-max normalization,
z-score standardization.
---------------------------------------Chapter Three: Exploratory Data Analysis
Chapter Three begins by discussing the difference between hypothesis testing, where an
a priori hypothesis is assessed, versus exploratory data analysis. Exploratory data
analysis (EDA), or graphical data analysis allows the analyst to (1) delve into the data
set, (2) examine the inter-relationships among the attributes, (3) identify interesting
subsets of the observations, and (4) develop an initial idea of possible associations
between the attributes and the target variable, if any. Extensive screen-shots of EDA
carried out on the churn data set are provided. Graphical and numerical methods for
handling correlated variables are examined. Next, exploratory methods for categorical
variables are investigated. Bar charts with overlays, normalized bar charts, and subsetted
bar charts are illustrated, along with crosstabulations and web graphs for investigating
the relationship between two categorical attributes. Throughout the chapter, nuggets of
information are drawn from the EDA that will help us to predict churn, that is, predict
who will leave the company’s service. EDA can help to uncover anomalous fields.
Next, exploratory methods for numerical variables are examined, including summary
statistics, correlations, histograms, normalized histograms, and histograms with overlays.
Then, exploratory methods for multivariate relationships are illustrated, including scatter
plots and 3D scatter plots. Methods for selecting interesting subsets of records are
discussed, along with binning to increase efficiency. The hands-on analysis problems
include challenging the reader to construct a comprehensive exploratory data analysis of
a large real-world data set, the adult data set, and reporting on the salient results.
Keywords:
Graphical data analysis, inter-relationships among variables, handling correlated
variables, categorical variables, normalized bar charts, crosstabulations, web graphs,
predicting churn, anomalous fields, summary statistics, histograms, scatter plots, binning.
---------------------------------------Chapter Four: Statistical Approaches to Estimation and Prediction
If estimation and prediction are considered to be data mining tasks, then statistical
analysts have been performing data mining for over a century. Chapter Four examines
some of the more widespread and traditional methods of estimation and prediction,
drawn from the world of statistical analysis. We begin by examining univariate
methods, statistical estimation and prediction methods that analyze one variable at a
time. Measures of center and location are defined, such as the mean, median, and mode,
followed by measures of spread or variability, such as the range and standard deviation.
Discovering Knowledge in Data: An Introduction to Data Mining
3
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
The process of statistical inference is described, using sample statistics to estimate the
unknown value of population parameters. Sampling error is defined, leading to methods
to measure the confidence of our estimates, such as the margin of error. The univariate
methods of point estimation and confidence interval estimation are discussed. Next we
consider simple linear regression, where the relationship between two numerical
variables is investigated. Standard errors are discussed, along with the meaning of the
regression coefficients. The dangers of extrapolation are illustrated, as are confidence
intervals and prediction intervals in the context of regression. Finally, we examine
multiple regression, where the relationship between a response variable and a set of
predictor variables is modeled linearly. Methods are shown for verifying regression
model assumptions. The hands-on analysis problems include generating simple linear
regression and multiple regression models for predicting nutrition rating for breakfast
cereals using the cereals data set.
Keywords:
Estimation, prediction, univariate methods, measures of center, mean, median, mode,
measures of spread, range, standard deviation, statistical inference, sample statistics,
population parameters, sampling error, margin of error, point estimation, confidence
interval estimation, simple linear regression, prediction intervals, multiple regression,
response variable, predictor variables, model assumptions.
---------------------------------------Chapter Five: The K-Nearest Neighbor Algorithm
Chapter Five begins with a discussion of the differences between supervised and
unsupervised methods. In unsupervised methods, no target variable is identified as such.
Most data mining methods are supervised methods, however, meaning that (a) there is a
particular pre-specified target variable, and (b) the algorithm is given many examples
where the value of the target variable is provided, so that the algorithm may learn which
values of the target variable are associated with which values of the predictor variables.
A general methodology for supervised modeling is provided, for building and evaluating
a data mining model. The training data set, test data set, and validation data sets are
discussed. The tension between model overfitting and underfitting is illustrated
graphically, as is the bias-variance tradeoff. High complexity models are associated with
high accuracy and high variability. The mean-square error is introduced, as a
combination of bias and variance. The general classification task is recapitulated. The knearest neighbor algorithm is introduced, in the context of a patient-drug classification
problem. Voting for different values of k are shown to sometimes lead to different
results. The distance function, or distance metric, is defined, with Euclidean distance
being typically chosen for this algorithm. The combination function is defined, for both
simple unweighted voting and weighted voting. Stretching the axes is shown as a method
for quantifying the relevance of various attributes. Database considerations, such as
balancing, are discussed. Finally, k-nearest neighbor methods for estimation and
prediction are examined, along with methods for choosing the best value for k.
Discovering Knowledge in Data: An Introduction to Data Mining
4
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
Keywords:
Supervised methods, unsupervised methods, a general methodology for supervised
modeling, training set, test set, validation set, overfitting, underfitting, the bias-variance
tradeoff, model complexity, mean-square error, classification task, drug classification,
distance function, combination function, weighted voting, unweighted voting, stretching
the axes.
---------------------------------------Chapter Six: Decision Trees
Chapter Six begins with the general description of a decision tree, as a collection of
decision nodes, connected by branches, extending downward from the root node until
terminating in leaf nodes. Beginning at the root node, attributes are tested at the decision
nodes, with each possible outcome resulting in a branch. Each branch then leads either
to another decision node or to a terminating leaf node. Node “purity” is introduced,
leading to a discussion of how one measures “uniformity” or “heterogeneity”. Two of
the main methods for measuring leaf node purity lead to the two leading algorithms for
constructing decision trees, discussed in this chapter, Classification and Regression
Trees (CART), and the C4.5 Algorithm. The CART algorithm is walked-through, using a
very small data set for classifying credit risk, based on savings, assets, and income. The
goodness of the results is assessed using the classification error measure. Next, we turn
to the C4.5 algorithm. Unlike CART, the C4.5 algorithm is not restricted to binary splits.
For categorical attributes, C4.5 by default produces a separate branch for each value of
the categorical attribute. The C4.5 method for measuring node homogeneity is quite
different from CART’s, using the concept of information gain or entropy reduction for
selecting the optimal split. These terms are defined and explained. Then the C4.5
algorithm is walked-through, using the same credit risk data set as above. Information
gain for each candidate split is calculated and compared, so that the best split at each
node may be identified. The resulting decision tree is compared with the CART model.
The generation of decision rules from trees is illustrated. Then, an application and
comparison of the CART and C4.5 algorithms to a real-world large data set is performed,
using Clementine software. Differences between the models are discussed. The
exercises include a challenge to readers to construct CART and C4.5 models by hand,
using the small salary-classification data set provided. The hands-on analysis problems
include generating CART and C4.5 models for the large churn data set.
Keywords:
Decision node, branches, root node, leaf node, node purity, node heterogeneity,
classification and regression trees, the C4.5 algorithm, classification error, binary splits,
information gain, entropy reduction, optimal split, decision rules.
----------------------------------------
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
5
Chapter Seven: Neural Networks
Chapter Seven begins with a brief comparison of the functions of artificial neurons with
real neurons. The need for input and output encoding is discussed, followed by the use
of neural networks for estimation and prediction. A simple example of a neural network
is presented, so that its topology and structure may be examined. A neural network
consists of a layered, feed-forward, completely-connected network of artificial neurons,
or nodes. The neural network is composed of two or more layers, though most networks
consist of three layers, an input layer, a hidden layer, and an output layer. There may be
more than one hidden layer, though most networks contain only one, which is sufficient
for most purposes. Each connection between nodes has a weight associated with it. The
combination function and the sigmoid activation function are examined, using a small
data set for illustration. The backpropagation algorithm is described, as a method for
minimizing the sum of squared errors. The gradient descent method is explained and
illustrated. Backpropagation rules are discussed, as a method for distributing error
responsibility throughout the network. A walk-through of the backpropagation
algorithm is performed, showing how the weights are adjusted, using a tiny sample data
set. Termination criteria are discussed, as are benefits and drawbacks of using neural
networks. The learning rate and the momentum term are motivated and illustrated.
Sensitivity analysis is explained, for identifying the most influential attributes. Finally,
an application of neural networks to a real-world large data set is carried out, using
Insightful Miner. The resulting network topology and connection weights are
investigated. The hands-on analysis problems include generating neural network models
for the large churn data set.
Keywords:
Network topology, layered network, feed-forward, completely-connected, input layer,
hidden layer, output layer, connection weights, combination function, activation function,
sigmoid function, backpropagation algorithm, sum of squared errors, gradient descent
method, backpropagation rules, error responsibility, termination criteria, learning rate,
momentum term, sensitivity analysis.
---------------------------------------Chapter Eight: Hierarchical and K-Means Clustering
Chapter Eight begins with a review of the clustering task, and the concept of distance.
Good clustering algorithms seek to construct clusters of records such that the betweencluster variation is large compared to the within-cluster variation. First, hierarchical
clustering methods are examined. In hierarchical clustering, a treelike cluster structure
(dendrogram) is created through recursive partitioning (divisive methods) or combining
(agglomerative) of existing clusters. Single-linkage, complete-linkage, and averagelinkage methods are discussed. The single-linkage and complete-linkage clustering
algorithms are walked-through, using a small univariate data set. Differences in the
resulting dendrogram structure are discussed. The average-linkage algorithm is shown
to produce the same dendrogram as the complete-linkage algorithm, for this data set,
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
6
though not necessarily in general. Next, we turn to the k-means clustering algorithm,
beginning with the definition of the steps involved in the algorithm. Cluster centroids
are defined. The k-means algorithm is walked-through, using a tiny bivariate data set,
showing graphically how the cluster centers are updated. An application of k-means
clustering to the large churn data set is undertaken, using SAS Enterprise Miner. The
resulting clusters are profiled. Finally, the methodology of using cluster membership for
further analysis downstream is illustrated, with the clusters identified by SAS Enterprise
Miner helping to predict churn. The exercises include challenges to readers to construct
single-linkage, complete-linkage, and k-means clustering solutions for small univariate
and bivariate data sets. The hands-on analysis problems include generating k-means
clusters using the cereals data set, and applying these clusters to help predict nutrition
rating.
Keywords:
Clustering task, between-cluster variation, within-cluster variation, hierarchical
clustering, dendrogram, divisive methods, agglomerative methods, single-linkage,
complete-linkage, average-linkage, cluster centroids, cluster profiles.
---------------------------------------Chapter Nine: Kohonen Networks
Chapter Nine begins with a discussion of self-organizing maps (SOM’s), a special class
of artificial neural networks, whose goal is to convert a complex high-dimensional input
signal into a simpler low-dimensional discrete map. SOM’s are based on competitive
learning, where the output nodes compete amongst themselves to be the winning node,
becoming selectively tuned to various input patterns. The topology of SOM’s is
illustrated. SOM’s exhibit three characteristic processes: competition, cooperation, and
adaptation. In competition, the output nodes compete with each other to produce the
best value for a particular scoring function, most commonly, the Euclidean distance.
Cooperation refers to all the nodes in this neighborhood sharing in the “reward” earned
by the winning nodes. Adaptation refers to the weight adjustment undergone by the
connections to the winning nodes and their neighbors. Kohonen networks are then
defined as a special class of SOM’s exhibiting kohonen learning. The Kohonen network
algorithm is defined. A walk-through of the Kohonen network algorithm is provided,
using a small data set. Cluster validity is discussed. An application of Kohonen network
clustering is examined, using the churn data set. The topology of the network is
compared to the Clementine results, showing how similar clusters are closer to each
other. Detailed cluster profiles are obtained. The use of cluster membership for
downstream classification is demonstrated. The hands-on analysis problems include
applying Kohonen network clustering to a real-world large data set, the Adult data set,
generating cluster profiles, and using cluster membership to classify the target variable
income using CART and C4.5 models.
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
7
Keywords:
Self-organizing maps, competitive learning, winning node, competition, cooperation,
adaptation, scoring function, Kohonen learning, cluster validity.
----------------------------------------
Chapter Ten: Association Rules
Chapter Ten begins with an introduction to affinity analysis, and a review of the
association task. Two prevalent algorithms for association rule mining are examined in
this chapter, the a priori algorithm, and the generalized rule induction (GRI) algorithm.
The basic concepts and terms of association rule mining are introduced, in the context of
market basket analysis, using a roadside vegetable stand example. Two different data
representations for market basket analysis are shown, transactional data format, and
tabular data format. Support, confidence, itemsets, frequency, and the a priori property
are defined, along with the two-step process for mining association rules. A walkthrough of the a priori algorithm is provided, using the roadside vegetable stand data.
First, the generation of frequent itemsets is shown, using the joining step and the pruning
step. Then the method of generating association rules from the set of frequent itemsets is
given, and the resulting association rules for the purchase of vegetables is illustrated.
These rules are ranked by support x confidence, and compared to the rules generated by
Clementine. Then, the extension of association rule mining from flag attributes to
general categorical attributes is discussed, and an example given from a large data set.
Next, the information-theoretic GRI algorithm is investigated. The J-measure, used by
GRI to assess the interestingness of a candidate association rule, is examined, and the
behavior of the J-measure is described. The GRI algorithm is then applied to a realworld data set, showing that it can be used for numerical variables as well as categorical.
Readers are advised about when not to use association rules. Various rule choice criteria
are compared for the a priori algorithm, including the confidence difference and
confidence ratio criteria. Discussions take place regarding whether association rule
mining represents supervised or unsupervised data mining, and of the difference between
local patterns and global models. The exercises include challenges to readers to generate
frequent itemsets and to mine association rules, by hand, from a well-known small data
set. The hands-on analysis problems include mining association rules from the large
churn data set, using both a priori and GRI, and comparing the results with findings
from earlier in the text.
Keywords:
Affinity analysis, market basket analysis, the a priori algorithm, the generalized rule
induction algorithm, transactional and tabular data formats, support, confidence,
itemsets, frequent itemsets, the a priori property, joining step, pruning step, the Jmeasure, confidence difference criterion, confidence ratio criterion, local patterns, global
models.
---------------------------------------Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
8
Chapter Eleven: Model Evaluation Techniques
Chapter Eleven is structured so that the various model evaluation techniques are
examined, classified by data mining task. First model evaluation techniques for the
description task are briefly discussed, including transparency and the minimum
descriptive length principle. Then model evaluation techniques for the estimation and
prediction tasks are outlined, including estimation error (or residual), mean square error,
and the standard error of the estimate. The heart of the chapter lies in the discussion of
model evaluation techniques for the classification task. The following evaluative
concepts, methods, and tools are discussed in the context of the C5.0 model for
classifying income from Chapter Six: error rate, false positives, false negatives, the
confusion matrix, error cost adjustment, lift, lift charts, and gains charts. The relation
ship between these terms and the terminology of hypothesis testing is discussed, such as
Type I error and Type II error. Misclassification cost adjustment is investigated, in order
to reflect real-world concerns, and readers are shown how to adjust misclassifications
costs using Clementine. Decision cost-benefit analysis is illustrated, showing how
analysts can provide model comparison in terms of anticipated profit or loss. An
example of estimated cost savings using this cost-benefit analysis is shown. Lift charts
and gains charts are defined and illustrated, including a combined lift chart, which shows
different models preferable over different areas. Emphasis is made on interweaving
model evaluation with model building. Stress is laid upon the importance of achieving a
confluence of results from a set of different models, thereby enhancing the analyst’s
confidence in the findings. The hands-on analysis problems include using all of the
techniques and methods learned in this chapter to select the best model from a set of
candidate models, two CART models, two C4.5 models, and a neural network model.
Keywords:
Minimum descriptive length principle, error rate, false positives, false negatives, the
confusion matrix, error (misclassification) cost adjustment, lift, lift charts, gains charts,
Type I error, Type II error, decision cost-benefit analysis, confluence of results.
----------------------------------------
Discovering Knowledge in Data: An Introduction to Data Mining
Chapter Summaries and Keywords
Copyright © 2003 - 2004 by Daniel T. Larose, Ph.D.
9