Download Data Mining with Clementine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining with Clementine
Girish Punj
Professor of Marketing
School of Business
University of Connecticut
Agenda

How to introduce data mining to students

Why Clementine?

Clementine features and capabilities

A typical data mining class

Useful teaching resources

Questions?
Introduce Data Mining to Students

Data mining chosen as one of top 10 emerging
technologies..” (MIT Technology Review)

Data mining expertise is most sought after...”
(Information Week Survey)

Data mining skills are an important part of the “toolkit”
needed by managers in a complex business world

Data Mining for job advancement and as career
insurance during good and bad economic times
Introduce Data Mining to Students
“When I looked at what companies were doing with
analytics I found it had moved from the back room to
the board room…a number of companies weren’t just
using analytics, they were now competing on
analytics -- they had made analytics the central strategy
of their business.”
(Tom Davenport, author of ‘Competing on Analytics’)
“We are drowning in information but starved for
knowledge.”
(John Naisbitt author of ‘Megatrends’)
Applications: Retail

Use data mining to understand
customers’ wants, needs, and
preferences

Based on this information, deliver
timely, personalized promotional
offers
Applications: Insurance

Leverage data and text
mining to speed claims
processing and help
reduce fraud
Applications: Manufacturing

Model historical production
and quality data to reduce
development time and
improve quality of
production processes
Applications: Telecom

Use data mining to identify
appropriate customer
segments for new
marketing initiatives

Predict likelihood of
customer churn and target
those likely to leave with
retention campaigns
Metaphor: Data Mining and Gold Mining
Data Mining and Knowledge Discovery

Data mining is the process of discovery of
interesting, meaningful and actionable patterns
hidden in large amounts of data (Han and Kamber
2006)

Knowledge Discovery (KD) as a more inclusive
term

Knowledge Discovery using a combination of
artificial and human intelligence

Data → Information → Knowledge
Data Mining and Statistics

Data Mining


No hypotheses are
needed
Can find patterns in very
large amounts of data

Uses all the data
available

Terminology used: field,
record, supervised
learning, unsupervised
learning

Statistics

Uses Hypothesis testing

Techniques are not
suitable for large datasets

Relies on sampling

Terminology used:
variable, observation,
analysis of dependence,
analysis of
interdependence
Deal with Numerophobia

Emphasize Differences between Statistics and Data
Mining to advantage (no probability distributions)

Use a math primer for numerically challenged
students
http://www.youtube.com/watch?v=nRKzseCLja8
Introduce Software to Students

Clementine 12.0:

Student Version (Clementine GradPack) is of
enterprise strength

Student License extends for about eight months
beyond course completion date

Directly address cost concerns by discussing value
of “investment”
Who was Clementine?

Daughter of a miner during the 1849
California Gold Rush who developed
a reputation…

“In a cavern, in a canyon,
Excavating for a mine
Dwelt a miner, forty niner,
And his daughter Clementine…”
http://www.empire.k12.ca.us/capistrano/mike/capmusic/the_wild_west/gold_rush/clemtine.mid
Introduce Software to Students

Visual approach makes model building an art form

Concept of “data flow” enables building of multiple
models

Point-and-click model building (no manual coding)

Comprehensive portfolio of models for the Business
Analyst as well as the Technical Expert
Clementine Basics: Building a Model
Clementine Basics: Select a Data Source
Clementine Basics: Select a Data File
Clementine Basics: Select a Data File
Clementine Basics: Read a Data File
Clementine Basics: Select Fields
Clementine Basics: Define Field Types
Clementine Basics: Visualize Data

Create tables and charts for means, ranges, and
correlations of all variables
Clementine Basics: Visualize Data

Examine associations among variables
using visual displays
Clementine Basics:
Select Target and Predictors
Clementine Basics: Execute Model
Clementine Basics: Review Model Results
Building Models in Clementine
Up sell/ Cross sell
Identify and target likely
churn candidates, and
create retention offerings
to decrease their
likelihood to churn
Customer
Churn
Creating business
rules for Up sell &
Cross Sell
Models
Propensity to
respond/purchase
Develop models on desired
purchase behavior, and target
candidates that are most likely to
respond
A Typical Clementine Model
29
Modeling Approaches

Can use auto “c.h.d”
settings (beginning user)

But can also use expert
capabilities (advanced user)
Data Mining Procedures

Estimation

Prediction

Classification

Clustering

Affinity/Association
Specific Methodologies Available
 Estimation
& Prediction:
- Neural networks
 Classification:
- Decision trees (2 types)
Specific Methodologies Available
 Clustering:
- K-means
- Kohonen networks
 Affinity/Association:
- Association rules (2 types)
Positioning the Course
Theory and
Concepts
Business
Applications
Clementine
Models
Focus of the
Course
A Typical Class

Discuss business applications of methodology based
on brief articles from the business press (30 minutes)

Present theory and concepts (30 minutes)

Build a Clementine model for students (30 minutes)

Ask students build a Clementine model (30 minutes)

Discuss homework assignment (15 minutes)

Students complete a homework assignment after class
(requires three hours)
Discuss Business Applications
“Wal-Mart's next competitive weapon is advanced data
mining, which it will use to forecast, replenish and
merchandise on a micro scale
By analyzing years' worth of sales data--and then
cranking in variables such as the weather and school
schedules--the system could predict the optimal number
of cases of Gatorade, in what flavors and sizes, a store
in Laredo, Texas, should have on hand the Friday before
Labor Day
Then, if the weather forecast suddenly called for
temperatures 5 hotter than last year, the delivery truck
would automatically show up with more”
From: “Can Wal-Mart Get Any Bigger,” Time, 13 January, 2003
Present Theory and Concepts
?
Are window cleaning products also purchased when
detergents and orange juice are bought together?
?
Where should detergents be placed
in the Store to maximize their sales?
?
?
Is soda typically purchased with
bananas? Does the brand of soda
make a difference?
How are the demographics of
the neighborhood affecting what
Customers are buying?
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Present Theory and Concepts

Start with a record of past purchase
transactions that link items purchased together
Purchase Transactions
Customer
1
2
3
4
5
Items
orange juice, soda
milk, orange juice, window cleaner
orange juice, detergent
orange juice, detergent, soda
window cleaner, soda
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Present Theory and Concepts

Create a co-occurrence matrix that pairs items
purchased together in the form of a table
Co-ocurrence Matrix
OJ
Window Cleaner
Milk
Soda
Detergent
OJ
Window
Cleaner
Milk
Soda
Detergent
4
1
1
2
1
1
2
1
1
0
1
1
1
0
0
2
1
0
3
1
1
0
0
1
2
The co-occurrence matrix shows the number of times
the “row” item was purchased with the “column” item
(note that the matrix is symmetrical)
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Present Theory and Concepts
Customer
Items Purchased
1
OJ, soda
2
Milk, OJ, window cleaner
3
OJ, detergent
4
OJ, detergent, soda
5
Window cleaner, soda

Rule Support = Percentage of transactions with both the items
of interest

What is the Support for the rule “If Soda, then OJ” ?



OJ and Soda are purchased together in 2 out of 5 transactions
Hence Support is 40%
What is the support for the rule “If OJ, then Soda” ?

Still 40%
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Present Theory and Concepts
Customer
Items Purchased
1
OJ, soda
2
Milk, OJ, window cleaner
3
OJ, detergent
4
OJ, detergent, soda
5
Window cleaner, soda

Confidence = Ratio of the number of transactions with both the items
of interest to the number of transactions with the “If” items

What is the Confidence for “If Soda, then OJ” ?



2 out of 3 soda purchase transactions also include OJ
Hence Confidence is 66.66%
What is the Confidence for “If OJ, then Soda” ?

2 out of 4 OJ purchase transactions also include soda
 Hence Confidence is 50%
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Present Theory and Concepts

Support (Prevalence): Percentage of records
in the dataset that match the antecedent
Support = p (antecedent)
Antecedent
OJ
Soda
Chips
OJ and Soda
OJ and Chips
Soda and Chips
OJ and Soda and Chips
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
Probability
45
42.5
40
25
20
15
5
%
%
%
%
%
%
%
Present Theory and Concepts

Confidence (Predictability): Percentage of records in the
dataset that match the antecedent and also match the
consequent
Confidence =
Rule
If OJ and Soda, then Chips
If OJ and Chips, then Soda
If Soda and Chips, then OJ
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
p (antecedent and consequent)
p (antecedent)
p(anteced.
p(anteced.)
and
confidence
consequent)
25%
20%
15%
5%
5%
5%
0.20
0.25
0.33
Present Theory and Concepts

Lift (Improvement): How much better a rule is at
predicting the consequent than chance alone?
Lift =
confidence
p (consequent)
A rule is only useful if Lift is > 1
Rule
If OJ and Soda then Chips
If OJ and Chips then Soda
If Soda and Chips then OJ
If OJ then Soda
From: Data Mining Techniques
by Michael J. A. Berry and Gordon S. Linoff
confidence p(consequent)
20%
25%
33%
56%
40.0%
42.5%
45.0%
42.5%
lift
0.50
0.59
0.73
1.31
Build a Clementine Model
Homework Assignment

Conduct a Market Basket Analysis on the dataset using both the
Apriori and GRI modeling nodes in Clementine.

Reconcile the association rules discovered as a result of the
Apriori and GRI modeling nodes.

Provide a narrative description that attempts to explain the
convergence (or lack thereof) between the results obtained from
the two modeling nodes.

Select those association rules discovered during your Market
Basket Analysis that would make the most intuitive sense to the
category managers involved and create demographic profiles of
shoppers who appear to fit those rules.
Instructor’s Laptop Screen
47
Student’s Laptop Screen
Resources

“Data Mining Techniques” by Michael J. A. Berry
Gordon S. Linoff (second edition), Wiley, 2004
and

“Discovering Knowledge in Data” by Daniel T. Larose,
Wiley, 2005

“Making Sense of Statistics” by Fred Pyrczak (fourth
edition), Pyrczak Publishing, 2006

Recent articles from the business press identified
using the “Factiva” database and “data mining”
“predictive
analytics” as search keywords

www.kdnuggets.com
Thank you for your time and participation

Questions?

Additional Information: Please see my syllabus at
http://www.spss.com/academic/educator/curriculum/index.htm?tab=1

Comments and suggestions are welcome. Please
send them to: [email protected]