Download Chapter 5 Data Mining a Closer Look

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 5
Data mining : A
Closer Look
Chapter Objectives
 Determine an appropriate data mining
strategy for a specific problem.
 Know about several data mining techniques
and how each technique builds a generalized
model to represent data.
 Understand how a confusion matrix is used
to help evaluate supervised learner models.
Data Warehouse and Data Mining
2
Chapter 5
Chapter Objectives
Understand basic techniques for
evaluating supervised learner models with
numeric output.
Know how measuring lift can be used to
compare the performance of several
competing supervised learner models.
Understand basic techniques for
evaluating unsupervised learner models.
Data Warehouse and Data Mining
3
Chapter 5
Data Mining Strategies
Classification is probably the best understood of
all data mining strategies.
Classification tasks have three common
characteristics.
• Learning is supervised.
• The dependent variable is categorical.
• The emphasis is on building models able to
assign new instances to one of a set of welldefined classes.
Data Warehouse and Data Mining
4
Chapter 5
Data Mining Strategies
• Some example classification tasks include the
following:
•Determine those characteristics that differentiate individuals
who have suffered a heart attack from those who have not.
• Develop a profile of a “successful” person.
• Determine if a credit card purchase is fraudulent.
• Classify a car loan applicant as a good or a poor credit risk.
• Develop a profile to differentiate female and male stroke
victims.
Data Warehouse and Data Mining
5
Chapter 5
Data Mining Strategies
Data Warehouse and Data Mining
6
Chapter 5
Data Mining Strategies
Data Warehouse and Data Mining
7
Chapter 5
Data Mining Strategies
Data Warehouse and Data Mining
8
Chapter 5
Data Mining Strategies
Data Warehouse and Data Mining
9
Chapter 5
Data Mining Strategies
34% are healthy within these
max heart rate range
Data Warehouse and Data Mining
10
Chapter 5
Supervised Data Mining Techniques
Data Warehouse and Data Mining
11
Chapter 5
Supervised Data Mining Techniques
Data Warehouse and Data Mining
12
Chapter 5
Supervised Data Mining Techniques
Data Warehouse and Data Mining
13
Chapter 5
Supervised Data Mining Techniques
Data Warehouse and Data Mining
14
Chapter 5
Supervised Data Mining Techniques
Data Warehouse and Data Mining
15
Chapter 5
Association Rules
Data Warehouse and Data Mining
16
Chapter 5
Clustering Techniques
Data Warehouse and Data Mining
17
Chapter 5
Clustering Techniques
Data Warehouse and Data Mining
18
Chapter 5
Evaluating Performance
Data Warehouse and Data Mining
19
Chapter 5
Evaluating Performance
Data Warehouse and Data Mining
20
Chapter 5
Evaluating Performance
Data Warehouse and Data Mining
21
Chapter 5
Evaluating Performance
Data Warehouse and Data Mining
22
Chapter 5
Evaluating Performance
Data Warehouse and Data Mining
23
Chapter 5
Chapter Summary
Data mining strategies include classification,
estimation, prediction, unsupervised clustering, and market
basket analysis.
Classification and estimation strategies are similar in
that each strategy is employed to build models able to
generalize current outcome.
However, the output of a classification strategy is
categorical, whereas the output of an estimation strategy is
numeric.
Data Warehouse and Data Mining
24
Chapter 5
Chapter Summary
A predictive strategy differs from a classification or
estimation strategy in that it is used to design models for
predicting future outcome rather than current behavior.
Unsupervised clustering strategies are employed to
discover hidden concept structures in data as well as to locate
atypical data instances.
The purpose of market basket analysis is to find
interesting relationships among retail products.
Discovered relationships can be used to design
promotions, arrange shelf or catalog items, or develop crossmarketing strategies.
Data Warehouse and Data Mining
25
Chapter 5
Chapter Summary
A data mining technique applies a data
mining strategy to a set of data.
Data mining techniques are defined by an
algorithm and a knowledge structure.
Common features that distinguish the various
techniques are whether learning is supervised or
unsupervised and whether their output is
categorical or numeric.
Data Warehouse and Data Mining
26
Chapter 5
Chapter Summary
Familiar supervised data mining techniques include decision
tree methods, production rule generators, neural networks, and
statistical methods.
Association rules are a favorite technique for marketing
applications.
Clustering techniques employ some measure of similarity to
group instances into disjoint partitions.
Clustering methods are frequently used to help determine a
best set of input attributes for building supervised learner
models.
Data Warehouse and Data Mining
27
Chapter 5
Chapter Summary
Performance evaluation is probably the most
critical of all the steps in the data mining
process.
Supervised model evaluation is often
performed using a training/test set scenario.
Supervised models with numeric output can
be evaluated by computing average absolute or
average squared error differences between
computed and desired outcome.
Data Warehouse and Data Mining
28
Chapter 5
Chapter Summary
Marketing applications that focus on mass mailings are
interested in developing models for increasing response rates to
promotions.
A marketing application measures the goodness of a model by
its ability to lift response rate thresholds to levels well above
those achieved by naïve (mass) mailing strategies.
Unsupervised models support some measure of cluster quality
that can be used for evaluative purposes.
Supervised learning can also be employed to evaluate the
quality of the clusters formed by an unsupervised model.
Data Warehouse and Data Mining
29
Chapter 5
Key Terms
Association rule. A production rule whose consequent
may contain multiple conditions and attribute
relationships. An output attribute in one association rule
can be an input attribute in other rule.
Classification. A supervised learning strategy where the
output attribute is categorical. The emphasis is on
building models able to assign new instances to one of
a set of well-defined classes.
Confusion matrix. A matrix used to summarize the
results of a supervised classification. Entries along the
main diagonal represent the total number of correct
classifications. Entries other than those on the main
diagonal represent classification errors.
Data Warehouse and Data Mining
30
Chapter 5
Key Terms
Data mining strategy. An outline of an approach for
problem solution.
Data mining technique. One or more algorithms together
with an associated knowledge structure.
Dependent variable. A variable whose value is determined
by a combination of one or more independent variables.
Estimation. A supervised learning strategy where the
output attribute is numeric. Emphasis is on determining
current rather than future outcome.
Data Warehouse and Data Mining
31
Chapter 5
Key Terms
Independent variable. An input attribute used for building
supervised or unsupervised learner models.
Lift. The probability of class Ci given a sample taken from
population P divided by the probability of Ci given the
entire population P.
Lift chart. A graph that displays the performance of a data
mining model as a function of sample size.
Linear regression. A supervised learning technique that
generalizes numeric data as a linear equation. The
equation defines the value of an output attribute as a
linear sum of weighted input attribute values.
Data Warehouse and Data Mining
32
Chapter 5
Key Terms
Market basket analysis. A data mining strategy that
attempts to find interesting relationships among retail
products.
Mean absolute error. For a set of training or test set
instances, the mean absolute error is the average
absolute difference between classifier predicted output
and actual output.
Mean squared error. For a set of training or test set
instances, the mean squared error is the average of the
sum of squared differences between classifier predicted
output and actual output.
Neural network. A set of interconnected nodes designed to
imitate the functioning of the human brain.
Data Warehouse and Data Mining
33
Chapter 5
Key Terms
Outliers. Atypical data instances.
Prediction. A supervised learning strategy designed to
determine future outcome.
Root mean squared error. The square root of the mean
squared error.
Rule Maker. A supervised learner model for generating
production rules from data.
Statistical regression. A supervised learning technique
that generalizes numerical data as a mathematical
equation. The equation defines the value of an output
attribute as a sum of weighted input attribute values.
Data Warehouse and Data Mining
34
Chapter 5
Data Warehouse and Data Mining
35
Chapter 5