Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Tools
Overview & Tutorial
Ahmed Sameh
Prince Sultan University
Department of Computer Science &
Info Sys
May 2010
(Some slides belong to IBM)
1
Introduction Outline
Goal: Provide an overview of data mining.
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
2
Introduction
Data is growing at a phenomenal
rate
Users expect more sophisticated
information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
3
Data Mining Definition
Finding hidden information in a
database
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
4
Data Mining Algorithm
Objective: Fit Data to a Model
Descriptive
Predictive
Preference – Technique to choose
the best model
Search – Technique to search the
data
“Query”
5
Database Processing vs. Data
Mining Processing
Query
Well defined
SQL
Data
– Operational data
Output
– Precise
– Subset of database
Query
Poorly defined
No precise query
language
Data
– Not operational data
Output
– Fuzzy
– Not a subset of database
6
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)
7
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
8
Statistics, Machine Learning
and Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part
of data mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results
Distinctions are fuzzy
9
Definition
A class of database application that analyze
data in a database using tools which look
for trends or anomalies.
Data mining was invented by IBM.
Purpose
To look for hidden patterns or previously
unknown relationships among the data in a
group of data that can be used to predict future
behavior.
Ex: Data mining software can help retail
companies find customers with common
interests.
Background Information
Many of the techniques used by today's data
mining tools have been around for many years,
having originated in the artificial intelligence
research of the 1980s and early 1990s.
Data Mining tools are only now being applied
to large-scale database systems.
The Need for Data Mining
The amount of raw data stored in corporate
data warehouses is growing rapidly.
There is too much data and complexity that
might be relevant to a specific problem.
Data mining promises to bridge the analytical
gap by giving knowledgeworkers the tools to
navigate this complex analytical space.
The Need for Data Mining, cont’
The need for information has resulted in the
proliferation of data warehouses that integrate
information multiple sources to support
decision making.
Often include data from external sources, such
as customer demographics and household
information.
Definition (Cont.)
Data mining is the exploration and analysis of large quantities
of data in order to discover valid, novel, potentially useful,
and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern
beforehand.
Useful: We can devise actions from the
patterns.
Understandable: We can interpret and
comprehend the patterns.
Of “laws”, Monsters, and Giants…
Moore’s law: processing “capacity” doubles
every 18 months : CPU, cache, memory
It’s more aggressive cousin:
Disk storage “capacity” doubles every 9
months
Disk TB Shipped per Year
1E+7
What do the two
“laws” combined
produce?
A rapidly growing
gap between our
ability to generate
data, and our ability
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
What is Data Mining?
Finding interesting structure in
data
Structure: refers to statistical patterns,
predictive models, hidden relationships
Examples of tasks addressed by Data Mining
Predictive Modeling (classification,
regression)
Segmentation (Data Clustering )
Summarization
Major Application Areas for
Data Mining Solutions
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
19
Data Mining
The non-trivial extraction of novel, implicit, and
actionable knowledge from large datasets.
Extremely large datasets
Discovery of the non-obvious
Useful knowledge that can improve processes
Can not be done manually
Technology to enable data exploration, data analysis,
and data visualization of very large databases at a high
level of abstraction, without a specific hypothesis in
mind.
Sophisticated data search capability that uses statistical
algorithms to discover patterns and correlations in data.
20
Data Mining (cont.)
21
Data Mining (cont.)
Data Mining is a step of Knowledge Discovery
in Databases (KDD) Process
Data Warehousing
Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation
Data Mining is sometimes referred to as KDD
and DM and KDD tend to be used as
synonyms
22
Data Mining Evaluation
23
Data Mining is Not …
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization
24
Data Mining Motivation
Changes in the Business Environment
Customers becoming more demanding
Markets are saturated
Databases today are huge:
More than 1,000,000 entities/records/rows
From 10 to 10,000 fields/attributes/variables
Gigabytes and terabytes
Databases a growing at an unprecedented
rate
Decisions must be made rapidly
Decisions must be made with maximum
knowledge
25
Why Use Data Mining Today?
Human analysis skills are inadequate:
Volume and dimensionality of the data
High data growth rate
Availability of:
Data
Storage
Computational power
Off-the-shelf software
Expertise
An Abundance of Data
Supermarket scanners, POS data
Preferred customer cards
Credit card transactions
Direct mail response
Call center records
ATM machines
Demographic data
Sensor networks
Cameras
Web server logs
Customer web site trails
Evolution of Database Technology
1960s: IMS, network model
1970s: The relational data model, first relational
DBMS implementations
1980s: Maturing RDBMS, application-specific
DBMS, (spatial data, scientific data, image data,
etc.), OODBMS
1990s: Mature, high-performance RDBMS
technology, parallel DBMS, terabyte data
warehouses, object-relational DBMS, middleware
and web technology
2000s: High availability, zero-administration,
seamless integration into business processes
2010: Sensor database systems, databases on
embedded systems, P2P database systems,
large-scale pub/sub systems, ???
Much Commercial Support
Many data mining tools
http://www.kdnuggets.com/software
Database systems with data mining
support
Visualization tools
Data mining process support
Consultants
Why Use Data Mining Today?
Competitive pressure!
“The secret of success is to know something that
nobody else knows.”
Aristotle Onassis
Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)
Personalization, CRM
The real-time enterprise
“Systemic listening”
Security, homeland defense
The Knowledge Discovery Process
Steps:
1. Identify business problem
2. Data mining
3. Action
4. Evaluation and measurement
5. Deployment and integration into
businesses processes
Data Mining Step in Detail
2.1 Data preprocessing
Data selection: Identify target
datasets and relevant fields
Data cleaning
Remove noise and outliers
Data transformation
Create common units
Generate new fields
2.2 Data mining model construction
2.3 Model evaluation
Preprocessing and Mining
Knowledge
Patterns
Preprocessed
Data
Target
Data
Interpretation
Model
Construction
Original Data
Preprocessing
Data
Integration
and Selection
Data Mining Techniques
Data Mining Techniques
Descriptive
Predictive
Clustering
Classification
Association
Decision Tree
Sequential Analysis
Rule Induction
Neural Networks
Nearest Neighbor Classification
Regression
34
Data Mining Models and Tasks
35
Basic Data Mining Tasks
Classification maps data into
predefined groups or classes
Supervised learning
Pattern recognition
Prediction
Regression is used to map a data item
to a real valued prediction variable.
Clustering groups similar data
together into clusters.
Unsupervised learning
Segmentation
Partitioning
36
Basic Data Mining Tasks (cont’d)
Summarization maps data into subsets
with associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationships
among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential
patterns.
37
Ex: Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
38
Data Mining vs. KDD
Knowledge Discovery in
Databases (KDD): process of
finding useful information and
patterns in data.
Data Mining: Use of algorithms to
extract the information and patterns
derived by the KDD process.
39
Data Mining Development
•Similarity Measures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Neural Networks
•Decision Tree Algorithms
40
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
41
KDD Issues (cont’d)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
42
Visualization Techniques
Graphical
Geometric
Icon-based
Pixel-based
Hierarchical
Hybrid
43
Data Mining Applications
44
Data Mining Applications:
Retail
Performing basket analysis
Which items customers tend to purchase together. This
knowledge can improve stocking, store layout
strategies, and promotions.
Sales forecasting
Examining time-based patterns helps retailers make
stocking decisions. If a customer purchases an item
today, when are they likely to purchase a
complementary item?
Database marketing
Retailers can develop profiles of customers with certain
behaviors, for example, those who purchase designer
labels clothing or those who attend sales. This
information can be used to focus cost–effective
promotions.
Merchandise planning and allocation
When retailers add new stores, they can improve
merchandise planning and allocation by examining
45
patterns in stores with similar demographic
Data Mining Applications:
Banking
Card marketing
By identifying customer segments, card issuers and
acquirers can improve profitability with more effective
acquisition and retention programs, targeted product
development, and customized pricing.
Cardholder pricing and profitability
Card issuers can take advantage of data mining
technology to price their products so as to maximize
profit and minimize loss of customers. Includes riskbased pricing.
Fraud detection
Fraud is enormously costly. By analyzing past
transactions that were later determined to be
fraudulent, banks can identify patterns.
Predictive life-cycle management
DM helps banks predict each customer’s lifetime value
and to service each segment appropriately (for example,
offering special deals and discounts).
46
Data Mining Applications:
Telecommunication
Call detail record analysis
Telecommunication companies accumulate detailed
call records. By identifying customer segments with
similar use patterns, the companies can develop
attractive pricing and feature promotions.
Customer loyalty
Some customers repeatedly switch providers, or
“churn”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers who
are likely to remain loyal once they switch, thus
enabling the companies to target their spending on
customers who will produce the most profit.
47
Data Mining Applications:
Other Applications
Customer segmentation
All industries can take advantage of DM to discover
discrete segments in their customer bases by
considering additional variables beyond traditional
analysis.
Manufacturing
Through choice boards, manufacturers are beginning to
customize products for customers; therefore they must
be able to predict which features should be bundled to
meet customer demand.
Warranties
Manufacturers need to predict the number of customers
who will submit warranty claims and the average cost of
those claims.
Frequent flier incentives
Airlines can identify groups of customers that can be
48
given incentives to fly more.
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
49
Data, Data everywhere
yet ...
I can’t find the data I need
data is scattered over the
network
many versions, subtle
differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I
found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed
from one form to other
50
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.
[Barry Devlin]
51
What are the users saying...
Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the
key to understanding data
over time
What-if capabilities are
required
52
What is Data Warehousing?
Information
Data
A process of
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
[Forrester Research, April
1996]
53
Very Large Data Bases
Terabytes -- 10^12 bytes:Walmart -- 24 Terabytes
Petabytes -- 10^15 bytes:Geographic Information
Systems
Exabytes -- 10^18 bytes: National Medical Records
Zettabytes -- 10^21
bytes:
Zottabytes -- 10^24
bytes:
Weather images
Intelligence Agency
Videos
54
Data Warehousing -It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not
previous possible
A decision support database
maintained separately from
the organization’s operational
database
55
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily in
organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
56
Data Warehousing Concepts
Decision support is key for companies wanting
to turn their organizational data into an
information asset
Traditional database is transaction-oriented
while data warehouse is data-retrieval
optimized for decision-support
Data Warehouse
"A subject-oriented, integrated, time-variant,
and non-volatile collection of data in support of
management's decision-making process"
OLAP (on-line analytical processing), Decision
Support Systems (DSS), Executive Information
Systems (EIS), and data mining applications
57
What does data warehouse do?
integrate diverse information from
various systems which enable users to
quickly produce powerful ad-hoc queries
and perform complex analysis
create an infrastructure for reusing the
data in numerous ways
create an open systems environment to
make useful information easily accessible
to authorized users
help managers make informed decisions
58
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate
decision-makers
59
Comparison of OLTP and Data Warehousing
OLTP systems
systems
Holds current data
Stores detailed data
Data is dynamic
Repetitive processing
heuristic
High level of transaction throughput
throughput
Predictable pattern of usage
Transaction driven
Application oriented
Supports day-to-day decisions
Serves large number of
clerical / operational users
Data warehousing
Holds historic data
Stores detailed, lightly, and
summarized data
Data is largely static
Ad hoc, unstructured, and
processing
Medium to low transaction
Unpredictable pattern of usage
Analysis driven
Subject oriented
Supports strategic decisions
Serves relatively lower number
of managerial users
60
Data Warehouse Architecture
Operational Data
Load Manager
Warehouse Manager
Query Manager
Detailed Data
Lightly and Highly Summarized Data
Archive / Backup Data
Meta-Data
End-user Access Tools
61
End-user Access Tools
Reporting and query tools
Application development tools
Executive Information System (EIS)
tools
Online Analytical Processing (OLAP)
tools
Data mining tools
62
Data Warehousing Tools and Technologies
Extraction, Cleansing, and Transformation
Tools
Data Warehouse DBMS
Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Networked data warehouse
Warehouse administration
Integrated dimensional tools
Advanced query functionality
63
Data Marts
A subset of data warehouse that
supports the requirements of a
particular department or business
function
64
Online Analytical Processing (OLAP)
OLAP
The dynamic synthesis, analysis, and
consolidation of large volume of multity
i
dimensional data
C
Multi-dimensional OLAP
Product
type
Cubes of data
Time
65
Problems of Data Warehousing
Underestimation of resources for
data loading
Hidden problem with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration
66
Codd's Rules for OLAP
Multi-dimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client-server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
67
OLAP Tools
Multi-dimensional OLAP (MOLAP)
Multi-dimensional DBMS (MDDBMS)
Relational OLAP (ROLAP)
Creation of multiple multi-dimensional
views of the two-dimensional relations
Managed Query Environment (MQE)
Deliver selected data directly from the
DBMS to the desktop in the form of a
data cube, where it is stored, analyzed,
and manipulated locally
68
Data Mining
Definition
The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large database and using
it to make crucial business decisions
Knowledge discovery
Association rules
Sequential patterns
Classification trees
Goals
Prediction
Identification
Classification
Optimization
69
Data Mining Techniques
Predictive Modeling
Supervised training with two phases
Training phase : building a model using
large sample of historical data called
the training set
Testing phase : trying the model on
new data
Database Segmentation
Link Analysis
Deviation Detection
70
What are Data Mining Tasks?
Classification
Regression
Clustering
Summarization
Dependency modeling
Change and Deviation Detection
71
What are Data Mining Discoveries?
New Purchase Trends
Plan Investment Strategies
Detect Unauthorized Expenditure
Fraudulent Activities
Crime Trends
Smugglers-border crossing
72
Data Warehouse Architecture
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Analyze
Query
Metadata Repository
73
Data Warehouse for Decision
Support & OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to go
to the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software
companies correlate with profits over last 10
years?
74
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and
can be ad-hoc
Used by managers and end-users to
understand the business and make
judgements
75
Data Mining works with Warehouse
Data
Data Warehousing
provides the Enterprise
with a memory
Data Mining provides
the Enterprise with
intelligence
76
We want to know ...
Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
Which types of transactions are likely to be fraudulent
given the demographics and transactional history of a
particular customer?
If I raise the price of my product by Rs. 2, what is the
effect on my ROI?
If I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will
result?
If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
77
Application Areas
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities
Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis
78
Data Mining in Use
The US Government uses Data Mining to
track fraud
A Supermarket becomes an information
broker
Basketball teams use it to track game
strategy
Cross Selling
Warranty Claims Routing
Holding on to Good Customers
Weeding out Bad Customers
79
What makes data mining possible?
Advances in the following areas are
making data mining deployable:
data warehousing
better and more data (i.e., operational,
behavioral, and demographic)
the emergence of easily deployed data
mining tools and
the advent of new data mining
techniques.
• -- Gartner Group
80
Why Separate Data Warehouse?
Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation
methods needed for multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
81
reconciled.
What are Operational Systems?
They are OLTP systems
Run mission critical
applications
Need to work with
stringent performance
requirements for
routine tasks
Used to run a
business!
82
RDBMS used for OLTP
Database Systems have been used
traditionally for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity are
critical
83
Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle large
numbers of simple read/write
transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products -- clerks,
salespeople etc.
They are increasingly used by
customers
84
Examples of Operational Data
Data
Industry Usage
Technology
Customer
File
All
Legacy application, flat Small-medium
files, main frames
Account
Balance
Point-ofSale data
Call
Record
Track
Customer
Details
Finance
Control
account
activities
Retail
Generate
bills, manage
stock
Telecomm- Billing
unications
Production ManufactRecord
uring
Control
Production
Volumes
Legacy applications,
Large
hierarchical databases,
mainframe
ERP, Client/Server,
Very Large
relational databases
Legacy application,
Very Large
hierarchical database,
mainframe
ERP,
Medium
relational databases,
AS/400
85
Application-Orientation vs.
Subject-Orientation
Application-Orientation
Subject-Orientation
Operational
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Savings
Product
Activity
86
OLTP vs. Data Warehouse
OLTP systems are tuned for known
transactions and workloads while
workload is not known a priori in a data
warehouse
Special data organization, access methods
and implementation methods are needed
to support data warehouse queries
(typically multidimensional queries)
e.g., average amount spent on phone calls
between 9AM-5PM in Pune during the month
of December
87
OLTP vs Data Warehouse
OLTP
Application
Oriented
Used to run
business
Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User
Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and
refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User
(Manager)
88
OLTP vs Data Warehouse
OLTP
Performance Sensitive
Few Records accessed at
a time (tens)
Read/Update Access
No data redundancy
Database Size
100MB
-100 GB
Data Warehouse
Performance relaxed
Large volumes accessed
at a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100 GB - few terabytes
89
OLTP vs Data Warehouse
OLTP
Transaction
throughput is the
performance metric
Thousands of users
Managed in
entirety
Data Warehouse
Query throughput
is the performance
metric
Hundreds of users
Managed by
subsets
90
To summarize ...
OLTP Systems are
used to “run” a
business
The Data
Warehouse helps
to “optimize” the
business
91
Why Now?
Data is being produced
ERP provides clean data
The computing power is available
The computing power is affordable
The competitive pressures are
strong
Commercial products are available
92
Myths surrounding OLAP Servers
and Data Marts
Data marts and OLAP servers are departmental
solutions supporting a handful of users
Million dollar massively parallel hardware is
needed to deliver fast time for complex queries
OLAP servers require massive and unwieldy
indices
Complex OLAP queries clog the network with
data
Data warehouses must be at least 100 GB to be
effective
– Source -- Arbor Software Home Page
93
II. On-Line Analytical Processing (OLAP)
Making Decision
Support Possible
Typical OLAP Queries
Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
95
What Is OLAP?
Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
96
The OLAP Market
Rapid growth in the enterprise market
1995: $700 Million
1997: $2.1 Billion
Significant consolidation activity among
major DBMS vendors
10/94: Sybase acquires ExpressWay
7/95: Oracle acquires Express
11/95: Informix acquires Metacube
1/97: Arbor partners up with IBM
10/96: Microsoft acquires Panorama
Result: OLAP shifted from small vertical
niche to mainstream DBMS category
97
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response
times
It is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
98
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
Nigel Pendse, Richard Creath - The OLAP Report
99
Multi-dimensional Data
“Hey…I sold $100M worth of goods”
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product
W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
1 2 34 5 6 7
Product
Industry
Region
Country
Time
Year
Category
Region
Quarter
Product
City
Month
Week
Month
Office
Day100
A Visual Operation: Pivot (Rotate)
Juice
10
Cola
47
Milk
30
Cream 12
Product
3/1 3/2 3/3 3/4
Date
101
“Slicing and Dicing”
The Telecomm Slice
Product
Household
Telecomm
Video
Audio
Europe
Far East
India
Retail Direct
Special
Sales Channel
102
Roll-up and Drill Down
Higher Level of
Aggregation
Sales Channel
Region
Country
State
Location Address
Sales
Representative
Low-level
Details
103
Results of Data Mining Include:
Forecasting what may happen in the
future
Classifying people or things into
groups by recognizing patterns
Clustering people or things into
groups based on their attributes
Associating what events are likely to
occur together
Sequencing what events are likely to
lead to later events
Data mining is not
Brute-force crunching of
bulk data
“Blind” application of
algorithms
Going to find relationships
where none exist
Presenting data in different
ways
A database intensive task
A difficult to understand
technology requiring an
advanced degree in
computer science
Data Mining versus OLAP
OLAP - On-line
Analytical
Processing
Provides you
with a very
good view of
what is
happening,
but can not
predict what
will happen in
the future or
why it is
happening
Data Mining Versus Statistical
Analysis
Data Analysis
Data Mining
for statistical
Originally developed to act Tests
correctness
of models
as expert systems to solve
Are statistical
problems
assumptions of models
Less interested in the
correct?
mechanics of the
Eg Is the R-Square
technique
good?
If it makes sense then
Hypothesis testing
let’s use it
Is the relationship
Does not require
significant?
assumptions to be made
about data
Use a t-test to validate
significance
Can find patterns in very
Tends to rely on sampling
large amounts of data
Techniques are not
Requires understanding
optimised for large
of data and business
amounts of data
problem
Requires strong statistical
skills
Examples of What People are
Doing with Data Mining:
Fraud/Non-Compliance
Anomaly detection
Recruiting/Attracting
customers
Isolate the factors that Maximizing
lead to fraud, waste and profitability (cross
selling, identifying
abuse
profitable customers)
Target auditing and
Service Delivery and
investigative efforts
Customer Retention
more effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Build profiles of
customers likely
to use which
services
Web Mining
What data mining has done for...
The US Internal Revenue Service
needed to improve customer
service and...
Scheduled its workforce
to provide faster, more accurate
answers to questions.
What data mining has done for...
The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and
analyzed suspects’ cell phone
usage to focus investigations.
What data mining has done for...
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and...
Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.
Suggestion:Predicting Washington
C-Span has lunched a digital
archieve of 500,000 hours of audio
debates.
Text Mining or Audio Mining of these
talks to reveal cwetrain questions
such as….
Example Application: Sports
IBM Advanced Scout analyzes
NBA game statistics
Shots blocked
Assists
Fouls
Google: “IBM Advanced Scout”
Advanced Scout
Example pattern: An analysis of the
data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that “When Glenn Rice
played the shooting guard position, he
shot 5/6 (83%) on jump shots."
Pattern is interesting:
The average shooting percentage for the
Charlotte Hornets during that game was
54%.
Data Mining: Types of Data
Relational data and transactional data
Spatial and temporal data, spatiotemporal observations
Time-series data
Text
Images, video
Mixtures of data
Sequence data
Features from processing other data
sources
Data Mining Techniques
Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
Different Types of Classifiers
Linear discriminant analysis (LDA)
Quadratic discriminant analysis
(QDA)
Density estimation methods
Nearest neighbor methods
Logistic regression
Neural networks
Fuzzy set theory
Decision Trees
Test Sample Estimate
Divide D into D1 and D2
Use D1 to construct the classifier d
Then use resubstitution estimate
R(d,D2) to calculate the estimated
misclassification error of d
Unbiased and efficient, but removes
D2 from training dataset D
V-fold Cross Validation
Procedure:
Construct classifier d from D
Partition D into V datasets D1, …, DV
Construct classifier di using D \ Di
Calculate the estimated misclassification
error R(di,Di) of di using test sample Di
Final misclassification estimate:
Weighted combination of individual
misclassification errors:
R(d,D) = 1/V Σ R(di,Di)
Cross-Validation: Example
d
d1
d2
d3
Cross-Validation
Misclassification estimate obtained
through cross-validation is usually
nearly unbiased
Costly computation (we need to
compute d, and d1, …, dV);
computation of di is nearly as
expensive as computation of d
Preferred method to estimate quality
of learning algorithms in the
machine learning literature
Decision Tree Construction
Three
Split
algorithmic components:
selection (CART, C4.5, QUEST,
CHAID, CRUISE, …)
Pruning (direct stopping rule, test
dataset pruning, cost-complexity
pruning, statistical tests, bootstrapping)
Data access (CLOUDS, SLIQ, SPRINT,
RainForest, BOAT, UnPivot operator)
Goodness of a Split
Consider node t with impurity phi(t)
The reduction in impurity through
splitting predicate s (t splits into
children nodes tL with impurity
phi(tL) and tR with impurity phi(tR))
is:
Δphi(s,t) = phi(t) – pL phi(tL) – pR
phi(tR)
Pruning Methods
Test dataset pruning
Direct stopping rule
Cost-complexity pruning
MDL pruning
Pruning by randomization testing
Stopping Policies
A stopping policy indicates when further
growth of the tree at a node t is
counterproductive.
All records are of the same class
The attribute values of all records are
identical
All records have missing values
At most one class has a number of
records larger than a user-specified
number
All records go to the same child node if t
is split (only possible with some split
Test Dataset Pruning
Use an independent test sample D’
to estimate the misclassification cost
using the resubstitution estimate
R(T,D’) at each node
Select the subtree T’ of T with the
smallest expected cost
Missing Values
What is the problem?
During computation of the splitting
predicate, we can selectively ignore
records with missing values (note that
this has some problems)
But if a record r misses the value of the
variable in the splitting attribute, r can
not participate further in tree
construction
Algorithms for missing values address
this problem.
Mean and Mode Imputation
Assume record r has missing value
r.X, and splitting variable is X.
Simplest algorithm:
If X is numerical (categorical), impute
the overall mean (mode)
Improved algorithm:
If X is numerical (categorical), impute
the mean(X|t.C) (the mode(X|t.C))
Decision Trees: Summary
Many application of decision trees
There are many algorithms available for:
Split selection
Pruning
Handling Missing Values
Data Access
Decision tree construction still active
research area (after 20+ years!)
Challenges: Performance, scalability,
evolving datasets, new applications
Supervised vs. Unsupervised Learning
Supervised
y=F(x): true function
D: labeled training set
D: {xi,F(xi)}
Learn:
G(x): model trained to
predict labels D
Goal:
E[(F(x)-G(x))2] ≈ 0
Well defined criteria:
Accuracy, RMSE, ...
Unsupervised
Generator: true model
D: unlabeled data
sample
D: {xi}
Learn
??????????
Goal:
??????????
Well defined criteria:
??????????
Clustering: Unsupervised Learning
Given:
Data Set D (training set)
Similarity/distance metric/information
Find:
Partitioning of data
Groups of similar/close items
Similarity?
Groups of similar customers
Similar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store
…
Similarity usually is domain/problem
specific
Clustering: Informal Problem
Definition
Input:
A data set of N records each given as a ddimensional data feature vector.
Output:
Determine a natural, useful “partitioning”
of the data set into a number of (k)
clusters and noise such that we have:
High similarity of records within each cluster
(intra-cluster similarity)
Low similarity of records between clusters
(inter-cluster similarity)
Types of Clustering
Hard Clustering:
Each object is in one and only one
cluster
Soft Clustering:
Each object has a probability of being
in each cluster
Clustering Algorithms
Partitioning-based clustering
K-means clustering
K-medoids clustering
EM (expectation maximization) clustering
Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Density-Based Methods
Regions of dense points separated by sparser
regions of relatively low density
K-Means Clustering Algorithm
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest
cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
Visualization at:
http://www.delftcluster.nl/textminer/theory/kmeans/kmeans.html
Issues
Why is K-Means working:
How does it find the cluster centers?
Does it find an optimal clustering
What are good starting points for the algorithm?
What is the right number of cluster centers?
How do we know it will terminate?
Agglomerative Clustering
Algorithm:
Put each item in its own cluster (all singletons)
Find all pairwise distances between clusters
Merge the two closest clusters
Repeat until everything is in one cluster
Observations:
Results in a hierarchical clustering
Yields a clustering for each possible number of
clusters
Greedy clustering: Result is not “optimal” for any
cluster size
Density-Based Clustering
A cluster is defined as a connected dense
component.
Density is defined in terms of number of
neighbors of a point.
We can find clusters of arbitrary shape
Market Basket Analysis
Consider shopping cart filled with
several items
Market basket analysis tries to
answer the following questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase
items?
Market Basket Analysis
Given:
A database of
customer
transactions
Each transaction is
a set of items
Example:
Transaction with
TID 111 contains
items {Pen, Ink,
Milk, Juice}
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Market Basket Analysis (Contd.)
Coocurrences
80% of all customers purchase items X,
Y and Z together.
Association rules
60% of all customers who purchase X
and Y also buy Z.
Sequential patterns
60% of customers who first buy X also
purchase Y within three weeks.
Confidence and Support
We prune the set of all possible
association rules using two
interestingness measures:
Confidence of a rule:
X Y has confidence c if P(Y|X) = c
Support of a rule:
X Y has support s if P(XY) = s
We can also define
Support of an itemset (a
coocurrence) XY:
Market Basket Analysis:
Applications
Sample Applications
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling
Applications of Frequent Itemsets
Market Basket Analysis
Association Rules
Classification (especially: text, rare
classes)
Seeds for construction of Bayesian
Networks
Web log analysis
Collaborative filtering
Association Rule Algorithms
More abstract problem redux
Breadth-first search
Depth-first search
Problem Redux
Abstract:
A set of items {1,2,…,k}
A dabase of transactions
(itemsets) D={T1, T2, …,
Tn},
Tj subset {1,2,…,k}
GOAL:
Find all itemsets that appear in
at least x transactions
(“appear in” == “are subsets
of”)
I subset T: T supports I
For an itemset I, the number of
transactions it appears in is
called the support of I.
Concrete:
I = {milk, bread, cheese,
…}
D={
{milk,bread,cheese},
{bread,cheese,juice}, …}
GOAL:
Find all itemsets that appear
in at least 1000
transactions
{milk,bread,cheese}
supports {milk,bread}
Problem Redux (Contd.)
Definitions:
An itemset is frequent if it
is a subset of at least x
transactions. (FI.)
An itemset is maximally
frequent if it is frequent
and it does not have a
frequent superset. (MFI.)
GOAL: Given x, find all
frequent (maximally
frequent) itemsets (to be
stored in the FI (MFI)).
Obvious relationship:
MFI subset FI
Example:
D={ {1,2,3}, {1,2,3},
{1,2,3}, {1,2,4} }
Minimum support x = 3
{1,2} is frequent
{1,2,3} is maximal frequent
Support({1,2}) = 4
All maximal frequent
itemsets: {1,2,3}
Applications
Spatial association rules
Web mining
Market basket analysis
User/customer profiling
ExtenSuggestionssions: Sequential
Patterns
In the “Market Itemset Analysis”
replace Milk, Pen, etc with names of
medications and use the idea in
Hospital Data mining new proposal
The idea of swaem intelligence – add
to it the extra analysis pf the
inducyion rules in this set of slides.
Kraft Foods: Direct Marketing
Company maintains a large database of purchases by customers.
Data mining
1. Analysts identified associations among groups of products
bought by particular segments of customers.
2. Sent out 3 sets of coupons to various households.
• Better response rates: 50 % increase in sales for one its
products
• Continue to use of this approach
Health Insurance Commission of Australia: Insurance Fraud
Commission maintains a database of insurance claims,including
laboratory tests ordered during the diagnosis of patients.
Data mining
1. Identified the practice of "up coding" to reflect more
expensive tests than are necessary.
2. Now monitors orders for lab tests.
• Commission expects to save US$1,000,000 / year by
eliminating the practice of "up coding”.
HNC Software: Credit Card Fraud
Payment Fraud
Large issuers of cards may lose
$10 million / year due to fraud
Difficult to identify the few transactions among thousands which
reflect potential fraud
Falcon software
Mines data through neural networks
Introduced in September 1992
Models each cardholder's requested transaction against the customer's
past spending history.
processes several hundred requests per second
compares current transaction with customer's history
identifies the transactions most likely to be frauds
enables bank to stop high-risk transactions before they are
authorized
Used by many retail banks: currently monitors
160 million card accounts for fraud
New Account Fraud
New Account Fraud
Fraudulent applications for credit cards are growing at 50 %
per year
Falcon Sentry software
Mines data through neural networks and a rule base
Introduced in September 1992
Checks information on applications against data from
credit bureaus
Allows card issuers to simultaneously:
increase the proportion of applications received
reduce the proportion of fraudulent applications
authorized
Quality Control
IBM Microelectronics: Quality Control
Analyzed manufacturing data on Dynamic Random Access Memory
(DRAM) chips.
Data mining
1. Built predictive models of
manufacturing yield (% non-defective)
effects of production parameters on chip performance.
2. Discovered critical factors behind
production yield &
product performance.
3. Created a new design for the chip
increased yield saved millions of dollars in direct
manufacturing costs
enhanced product performance by substantially lowering the
memory cycle time
Retail Sales
B & L Stores
Belk and Leggett Stores =
one of largest retail chains
280 stores in southeast U.S.
data warehouse contains 100s of gigabytes (billion
characters) of data
data mining to:
increase sales
reduce costs
Selected DSS Agent from MicroStrategy, Inc.
analyize merchandizing (patterns of sales)
manage inventory
Market Basket Analysis
DSS Agent
uses intelligent agents data mining
provides multiple functions
recognizes sales patterns among stores
discovers sales patterns by
time of day
day of year
category of product
etc.
swiftly identifies trends & shifts in customer tastes
performs Market Basket Analysis (MBA)
analyzes Point-of-Sale or -Service (POS) data
identifies relationships among products and/or services purchased
E.g. A customer who buys Brand X slacks has a 35% chance of
buying Brand Y shirts.
Agent tool is also used by other Fortune 1000 firms
average ROI > 300 %
Case Based Reasoning
(CBR)
case A
case B
target
General scheme for a case based reasoning (CBR) model. The target case
matched against similar precedents in the historical database, such as case
Case Based Reasoning (CBR)
Learning through the accumulation of experience
Key issues
Indexing:
storing cases for quick, effective access of precedents
Retrieval:
accessing the appropriate precedent cases
Advantages
Explicit knowledge form recognizable to humans
No need to re-code knowledge for computer processing
Limitations
Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar population size
Traditional approach ignores the issue of generalizing knowledge
Genetic Algorithm
Generation of candidate solutions using the procedures of biological
evolution.
Procedure
0. Initialize.
Create a population of potential solutions ("organisms").
1. Evaluate.
Determine the level of "fitness" for each solution.
2. Cull.
Discard the poor solutions.
3. Breed.
a. Select 2 "fit" solutions to serve as parents.
b. From the 2 parents, generate offspring.
* Crossover:
Cut the parents at random and switch the 2 halves.
* Mutation:
Randomly change the value in a parent solution.
4. Repeat.
Go back to Step 1 above.
Genetic Algorithm (Cont.)
Advantages
Applicable to a wide range of problem domains.
Robustness:
can obtain solutions even when the performance
function is highly irregular or input data are noisy.
Implicit parallelism:
can search in many directions concurrently.
Limitations
Slow, like neural networks.
But: computation can be distributed
over multiple processors
(unlike neural networks)
Source: www.pathology.washington.edu
Multistrategy Learning
Every technique has advantages & limitations
Multistrategy approach
Take advantage of the strengths of diverse techniques
Circumvent the limitations of each methodology
Types of Models
Prediction Models for
Descriptive Models for
Predicting and Classifying Grouping and Finding
Regression algorithms
Associations
(predict numeric
outcome): neural
Clustering/Grouping
networks, rule induction,
algorithms: K-means,
CART (OLS regression,
Kohonen
GLM)
Association algorithms:
Classification algorithm
predict symbolic
apriori, GRI
outcome): CHAID, C5.0
(discriminant analysis,
logistic regression)
Neural Networks
Description
Difficult interpretation
Tends to ‘overfit’ the data
Extensive amount of training time
A lot of data preparation
Works with all data types
Rule Induction
Description
Intuitive output
Handles all forms of numeric data,
as well as non-numeric (symbolic)
data
C5 Algorithm a special case of rule
induction
Apriori
Description
Seeks association rules
in dataset
‘Market basket’ analysis
Sequence discovery
Data Mining Is
The automated process of finding
relationships and patterns in stored
data
It is different from the use of SQL
queries and other business
intelligence tools
Data Mining Is
Motivated by business need, large
amounts of available data, and
humans’ limited cognitive processing
abilities
Enabled by data warehousing,
parallel processing, and data mining
algorithms
Common Types of Information
from Data Mining
Associations -- identifies occurrences
that are linked to a single event
Sequences -- identifies events that
are linked over time
Classification -- recognizes patterns
that describe the group to which an
item belongs
Common Types of Information
from Data Mining
Clustering -- discovers different
groupings within the data
Forecasting -- estimates future
values
Commonly Used Data Mining
Techniques
Artificial neural networks
Decision trees
Genetic algorithms
Nearest neighbor method
Rule induction
The Current State of Data Mining
Tools
Many of the vendors are small companies
IBM and SAS have been in the market for
some time, and more “biggies” are
moving into this market
BI tools and RDMS products are
increasingly including basic data mining
capabilities
Packaged data mining applications are
becoming common
The Data Mining Process
Requires personnel with domain,
data warehousing, and data mining
expertise
Requires data selection, data
extraction, data cleansing, and data
transformation
Most data mining tools work with
highly granular flat files
Is an iterative and interactive
process
Why Data Mining
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor? :
Data Mining helps extract such
information
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications,
financial transactions
from an online stream of event identify fraudulent
events
Manufacturing and production:
automatically adjust knobs when process parameter
changes
Applications (continued)
Medicine: disease outcome, effectiveness
of treatments
analyze patient disease history: find
relationship between diseases
Molecular/Pharmaceutical: identify new
drugs
Scientific data analysis:
identify new galaxies by searching for sub
clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify
The KDD process
Problem fomulation
Data collection
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis,
heuristic search
Pre-processing: cleaning
name/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features
e.g. frequency
Choosing mining task and mining method:
Result evaluation and Visualization:
Knowledge discovery is an iterative process
Relationship with other fields
Overlaps with machine learning, statistics,
artificial intelligence, databases,
visualization but more stress on
scalability of number of features and instances
stress on algorithms and architectures
whereas foundations of methods and
formulations provided by statistics and
machine learning.
automation for handling large, heterogeneous
data
Some basic operations
Predictive:
Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Classification
Given old data about customers and
payments, predict new applicant’s
loan eligibility.
Previous customers
Classifier
Decision rules
Salary > 5 L
Age
Salary
Profession
Location
Customer type
Prof. = Exec
New applicant’s data
Good/
bad
Classification methods
Goal: Predict class Ci = f(x1, x2, ..
Xn)
Regression: (linear or any other
polynomial)
a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision
space into piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-
Nearest neighbor
Define proximity between instances,
find neighbors of new instance and
assign majority class
Case based reasoning: when
attributes are more complicated than
• Cons
• Pros
real-valued.
+ Fast training
– Slow during application.
– No feature selection.
– Notion of proximity vague
Clustering
Unsupervised learning when old data with
class labels not available e.g. when
introducing a new product.
Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify micro-markets and develop
policies for each
Applications
Customer segmentation e.g. for targeted
marketing
Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
Identify micro-markets and develop policies
for each
Collaborative filtering:
group based on common items purchased
Text clustering
Compression
Distance functions
Numeric data: euclidean, manhattan
distances
Categorical data: 0/1 to indicate
presence/absence followed by
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of
1s)
data dependent measures: similarity of A and
B depends on co-occurance with C.
Combined numeric and categorical data:
weighted normalized distance:
Clustering methods
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based: K-means
model-based: EM
density-based:
Partitional methods: K-means
Criteria: minimize sum of square of
distance
Between each point and centroid of the
cluster.
Between each pair of points in the
cluster
Algorithm:
Select initial partition with K clusters:
random, first K, K separated points
Repeat until stabilization:
Assign each point to closest cluster
center
Collaborative Filtering
Given database of user preferences,
predict preference of new user
Example: predict what new movies you will
like based on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person
may want to buy
(and suggest it, or give discounts to
tempt customer)
Association rules
Given set T of groups of items
Example: set of item sets
purchased
Goal: find all rules on itemsets
of the form a-->b such that
T
Milk, cereal
Tea, milk
Tea, rice, bread
support of a and b > user
threshold s
conditional probability (confidence)
of b given a > user threshold c
cereal
Example: Milk --> bread
Purchase of product A -->
Prevalent Interesting
Analysts already
know about
prevalent rules
Interesting rules
are those that
deviate from prior
expectation
Mining’s payoff is
in finding
surprising
phenomena
Zzzz...
1995
Milk and
cereal sell
together!
1998
Milk and
cereal sell
together!
Applications of fast itemset
counting
Find correlated events:
Applications in medicine: find
redundant tests
Cross selling in retail, banking
Improve predictive capability of
classifiers that assume attribute
independence
New similarity measures of
categorical attributes [Mannila et al,
Application Areas
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities
Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis
Usage scenarios
Data warehouse mining:
assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in process
control
Stages in mining:
data selection pre-processing:
cleaning transformation mining
result evaluation visualization
Mining market
Around 20 to 30 mining tool vendors
Major tool players:
Clementine,
IBM’s Intelligent Miner,
SGI’s MineSet,
SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products:
fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Vertical integration:
Mining on the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception,
Wisewire
Inventory control: what was a shopper
looking for and could not find..
State of art in mining OLAP integration
Decision trees [Information discovery,
Cognos]
find factors influencing high profits
Clustering [Pilot software]
segment customers to define hierarchy on that
dimension
Time series analysis: [Seagate’s Holos]
Query for various shapes along time: eg. spikes,
outliers
Multi-level Associations [Han et al.]
find association between members of dimensions
Data Mining in Use
The US Government uses Data Mining to
track fraud
A Supermarket becomes an information
broker
Basketball teams use it to track game
strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers
Some success stories
Network intrusion detection using a combination
of sequential rule discovery and classification
tree on 4 GB DARPA data
Won over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/
provides good detailed description of the entire process
Major US bank: customer attrition prediction
First segment customers based on financial behavior:
found 3 segments
Build attrition models for each of the 3 segments
40-50% of attritions were predicted == factor of 18
increase
Targeted credit marketing: major US banks
find customer segments based on 13 months credit
balances
What is KnowledgeSeeker?
Produced by ANGOSS Software Corporation,
who focus “solely” on data mining software.
Offer training and consulting services
Produce data mining add-ins which accepts
data from all major databases
Works with popular query and reporting,
spreadsheet, statistical and OLAP & ROLAP
tools.
Data Mining
19
9
Major Competitors
Company
Software
Clementine 6.0
Enterprise Miner 3.0
Intelligent Miner
Data Mining
20
0
Major Competitors
Company
Software
Mineset 3.1
Darwin
Scenario
Data Mining
20
1
Current Applications
Manufacturing
Used by the R.R. Donnelly & Sons commercial
printing company to improve process control, cut
costs and increase productivity.
Used extensively by Hewlett Packard in their
United States manufacturing plants as a process
control tool both to analyze factors impacting
product quality as well as to generate rules for
production control systems.
Data Mining
20
2
Current Applications
Auditing
Used by the IRS to combat fraud,
reduce risk, and increase collection
rates.
Finance
Used by the Canadian Imperial Bank
of Commerce (CIBC) to create
models for fraud detection and risk
management.
Data Mining
20
3
Current Applications
CRM
Telephony
Used by US West to reduce churning and
increase customer loyalty for a new voice
messaging technology.
Data Mining
20
4
Current Applications
Marketing
Used by the Washington Post to
improve their direct mail targeting
and to conduct survey analysis.
Health Care
Used by the Oxford Transplant
Center to discover factors affecting
transplant survival rates.
Used by the University of Rochester
Cancer Center to study the effect of
anxiety on chemotherapy-related
nausea.
Data Mining
20
5
More Customers
Data Mining
20
6
Questions
1.
What percentage of people in the test group have high blood pressure
with these characteristics: 66-year-old male regular smoker that has
low to moderate salt consumption?
2.
Do the risk levels change for a male with the same characteristics who
quit smoking? What are the percentages?
3.
If you are a 2% milk drinker, how many factors are still interesting?
4.
Knowing that salt consumption and smoking habits are interesting
factors, which one has a stronger correlation to blood pressure levels?
5.
Grow an automatic tree. Look to see if gender is an interesting factor
for 55-year-old regular smoker who does not each cheese?
Data Mining
20
7
Association
Classic market-basket analysis, which treats the
purchase of a number of items (for example, the
contents of a shopping basket) as a single transaction.
This information can be used to adjust inventories,
modify floor or shelf layouts, or introduce targeted
promotional activities to increase overall sales or
move specific products.
Example : 80 percent of all transactions in which
beer was purchased also included potato chips.
Sequence-based analysis
Traditional market-basket analysis deals with
a collection of items as part of a point-in-time
transaction.
to identify a typical set of purchases that might
predict the subsequent purchase of a specific
item.
Clustering
Clustering approach address segmentation
problems.
These approaches assign records with a large
number of attributes into a relatively small set of
groups or "segments."
Example : Buying habits of multiple population
segments might be compared to determine which
segments to target for a new sales campaign.
Classification
Most commonly applied data mining
technique
Algorithm uses preclassified examples to
determine the set of parameters required for
proper discrimination.
Example : A classifier derived from the
Classification approach is capable of
identifying risky loans, could be used to aid in
the decision of whether to grant a loan to an
individual.
Issues of Data Mining
Present-day tools are strong but require
significant expertise to implement effectively.
Issues of Data Mining
Susceptibility to "dirty" or irrelevant data.
Inability to "explain" results in human terms.
Issues
susceptibility to "dirty" or irrelevant data
Data mining tools of today simply take everything
they are given as factual and draw the resulting
conclusions.
Users must take the necessary precautions to
ensure that the data being analyzed is "clean."
Issues, cont’
inability to "explain" results in human terms
Many of the tools employed in data mining
analysis use complex mathematical algorithms that
are not easily mapped into human terms.
what good does the information do if you don’t
understand it?
Comparison with reporting, BI and
OLAP
Data Mining
Reporting
Complex
Simple
relationships
relationships
Automatically find
Choose the
the relevant factors
relevant factors
Show only relevant
Examine all
details
details
Prediction…
(Also applies to
visualisation &
simple statistics)
Comparison with Statistics
Statistical analysis
Mainly about
hypothesis testing
Focussed on
precision
Data mining
Mainly about
hypothesis
generation
Focussed on
deployment
Example: data mining and customer
processes
Insight: Who are my customers and
why do they behave the way they
do?
Prediction: Who is a good prospect,
for what product, who is at risk,
what is the next thing to offer?
Uses: Targeted marketing, mailshots, call-centres, adaptive websites
Example: data mining and fraud
detection
Insight: How can (specific
method of) fraud be
recognised? What constitute
normal, abnormal and
suspicious events?
Prediction: Recognise
similarity to previous frauds –
how similar?
Spot abnormal events – how
suspicious?
Example: data mining and
diagnosing cancer
Complex data from genetics
Challenging data mining problem
Find patterns of gene activation
indicating different diseases / stages
“Changed the way I think about
cancer” Oncologist from Chicago Children’s
Memorial Hospital
Example: data mining and policing
Knowing the patterns helps plan
effective crime prevention
Crime hot-spots understood better
Sift through mountains of crime
reports
Identify crime series
“Other people save money using
data mining – we save lives.” Police
force homicide specialist and data miner
Data mining tools:
Clementine and its philosophy
How to do data mining
Lots of data mining operations
How do you glue them together to
solve a problem?
How do we actually do data mining?
Methodology
Not just the right way, but any way…
Myths about Data Mining (1)
Data, Process and Tech
Data mining is all about
massive data
It can be, but some important
datasets are very small, and
sampling is often appropriate
Data mining is a
technical process
Business analysts perform
data mining every day
It is a business process
Data mining is all
about algorithms
Algorithms are a key tool
But data mining is done by
people, not by algorithms
Data mining is all
about predictive accuracy
It's about usefulness
Accuracy is only a small
component
Myths about Data Mining (2)
Data Quality
Data mining only works
with clean data
Cleaning the data is part
of the data mining process
Need not be clean initially
Data mining only works
with complete data
Data mining works with
whatever data you have.
Complete is good,
incomplete is also ok.
Data mining only works
with correct data
Errors in data are inevitable.
Data mining helps you deal
with them.
One last exploding myth
Neural Networks are not useful
when you need to understand the
patterns that you find
(which
nearly always
Relatedis
to over-simplistic
views of in
datadata
mining
mining)
Data mining techniques form a toolkit
We often use techniques in surprising ways
E.g. Neural nets for field selection
Neural nets for pattern confirmation
Neural nets combined with other techniques
for cross-checking
What use is a pair of pliers?
Related Concepts Outline
Goal: Examine some areas which are related to data
mining.
Database/OLTP Systems
Fuzzy Sets and Logic
Information Retrieval(Web Search
Engines)
Dimensional Modeling
Data Warehousing
OLAP/DSS
Statistics
Machine Learning
Pattern Matching
226
Fuzzy Sets and Logic
Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
f(x): Probability x is in F.
1-f(x): Probability x is not in F.
EX:
T = {x | x is a person and x is tall}
Let f(x) be the probability that x is tall
Here f is the membership function
DM: Prediction and classification are
fuzzy.
227
Information Retrieval
Information Retrieval (IR): retrieving desired
information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
228
Dimensional Modeling
View data in a hierarchical manner more
as business executives might
Useful in decision support systems and
mining
Dimension: collection of logically
related attributes; axis for modeling
data.
Facts: data stored
Ex: Dimensions – products, locations,
date
Facts – quantity, unit price
DM: May view data as dimensinoal.
© Prentice Hall
229
Dimensional Modeling Queries
Roll Up: more general dimension
Drill Down: more specific
dimension
Dimension (Aggregation) Hierarchy
SQL uses aggregation
Decision Support Systems
(DSS): Computer systems and
tools to assist managers in making
decisions and solving problems.
230
Cube view of Data
231
Data Warehousing
“Subject-oriented, integrated, time-variant,
nonvolatile” William Inmon
Operational Data: Data used in day to day
needs of company.
Informational Data: Supports other functions
such as planning and forecasting.
Data mining tools often access data warehouses
rather than operational data.
DM: May access data in
warehouse.
232
OLAP
Online Analytic Processing (OLAP): provides
more complex queries than OLTP.
OnLine Transaction Processing (OLTP):
traditional database/transaction processing.
Dimensional data; cube view
Visualization of operations:
Slice: examine sub-cube.
Dice: rotate cube to look at another dimension.
Roll Up/Drill Down
DM: May use OLAP queries.
233
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
234
Statistics
Simple descriptive models
Statistical inference: generalizing a
model created from a sample of the data
to the entire dataset.
Exploratory Data Analysis:
Data can actually drive the creation of
the model
Opposite of traditional statistical view.
Data mining targeted to business user
DM: Many data mining methods
come from statistical
techniques.
235
Machine Learning
Machine Learning: area of AI that examines
how to write programs that can learn.
Often used in classification and prediction
Supervised Learning: learns by example.
Unsupervised Learning: learns without
knowledge of correct answers.
Machine learning often deals with small static
datasets.
DM: Uses many machine learning
techniques.
236
Pattern Matching (Recognition)
Pattern Matching: finds
occurrences of a predefined pattern
in the data.
Applications include speech
recognition, information retrieval,
time series analysis.
DM: Type of classification.
© Prentice Hall
237
DM vs. Related Topics
Area
Query
Data
DB/OLTP Precise Database
IR
OLAP
DM
Results Output
Precise DB Objects
or
Aggregation
Precise Documents
Vague Documents
Analysis Multidimensional Precise DB Objects
or
Aggregation
Vague Preprocessed Vague KDD
Objects
238
Data Mining Techniques Outline
Goal: Provide an overview of basic data
mining techniques
Statistical
Point Estimation
Models Based on Summarization
Bayes Theorem
Hypothesis Testing
Regression and Correlation
Similarity Measures
Decision Trees
Neural Networks
Activation Functions
Genetic Algorithms
© Prentice Hall
239
Point Estimation
Point Estimate: estimate a population
parameter.
May be made by calculating the
parameter for a sample.
May be used to predict value for missing
data.
Ex:
R contains 100 employees
99 have salary information
Mean salary of these is $50,000
Use $50,000 as value of remaining
employee’s salary.
Is this a good idea?
240
Estimation Error
Bias: Difference between expected value
and actual value.
Mean Squared Error (MSE): expected
value of the squared difference between
the estimate and the actual value:
Why square?
Root Mean Square Error (RMSE)
241
Expectation-Maximization (EM)
Solves estimation with incomplete
data.
Obtain initial estimates for
parameters.
Iteratively use estimates for
missing data and continue until
convergence.
242
Models Based on Summarization
Visualization: Frequency distribution, mean,
variance, median, mode, etc.
Box Plot:
243
Bayes Theorem
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
Assign probabilities of hypotheses given a
data value.
244
Hypothesis Testing
Find model to explain behavior by
creating and then testing a
hypothesis about the data.
Exact opposite of usual DM
approach.
H0 – Null hypothesis; Hypothesis to
be tested.
H1 – Alternative hypothesis
245
Regression
Predict future values based on past
values
Linear Regression assumes linear
relationship exists.
y = c0 + c1 x1 + … + cn xn
Find values to best fit the data
246
Correlation
Examine the degree to which the
values for two variables behave
similarly.
Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
247
Similarity Measures
Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measure measure
how unlike or dissimilar objects are.
© Prentice Hall
248
Distance Measures
Measure dissimilarity between
objects
249
Decision Trees
Decision Tree (DT):
Tree where the root and each internal
node is labeled with a question.
The arcs represent each possible answer
to the associated question.
Each leaf node represents a prediction
of a solution to the problem.
Popular technique for classification;
Leaf node indicates class to which the
corresponding tuple belongs.
250
Decision Trees
A Decision Tree Model is a
computational model consisting of three
parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to data
Creation of the tree is the most difficult
part.
Processing is basically a search similar
to that in a binary search tree (although
DT may not be© binary).
Prentice Hall
251
Neural Networks
Based on observed functioning of
human brain.
(Artificial Neural Networks
(ANN)
Our view of neural networks is very
simplistic.
We view a neural network (NN)
from a graphical viewpoint.
Alternatively, a NN may be viewed
from the perspective of matrices.
Used in pattern recognition, speech
recognition, computer vision, and
classification.© Prentice Hall
252
Generating Rules
Decision tree can be converted into a
rule set
Straightforward conversion:
each path to the leaf becomes a rule –
makes an overly complex rule set
More effective conversions are not
trivial
(e.g. C4.8 tests each node in root-leaf
path to see if it can be eliminated
without loss in accuracy)
253
Covering algorithms
Strategy for generating a rule set
directly: for each class in turn find
rule set that covers all instances in it
(excluding instances not in the class)
This approach is called a covering
approach because at each stage a
rule is identified that covers some of
the instances
254
Rules vs. trees
Corresponding decision tree:
(produces exactly the same
predictions)
But: rule sets can be more clear
when decision trees suffer from
replicated subtrees
Also: in multi-class situations,
covering algorithm concentrates on
one class at a time
whereas decision
255
A simple covering algorithm
Generates a rule by adding tests that
maximize rule’s accuracy
Similar to situation in decision trees:
problem of selecting an attribute to
split on
s pace of
But: decision tree inducer maximizes
examples
overall purity
rule s o far
Each new test reduces
rule’s coverage:
256
witten&eibe
rule after
adding new
term
Algorithm Components
1. The task the algorithm is used to address (e.g.
classification, clustering, etc.)
2. The structure of the model or pattern we are fitting to
the data (e.g. a linear regression model)
3. The score function used to judge the quality of the
fitted models or patterns (e.g. accuracy, BIC, etc.)
4. The search or optimization method used to search over
parameters and/or structures (e.g. steepest descent,
MCMC, etc.)
5. The data management technique used for storing,
indexing, and retrieving data (critical when data too large
to reside in memory)
Models and Patterns
Models
Prediction
•Linear
regression
•Piecewise linear
Probability
Distributions
Structured
Data
Models
Prediction
•Linear
regression
•Piecewise linear
•Nonparamatric
regression
Probability
Distributions
Structured
Data
Models
Prediction
•Linear
regression
Probability
Distributions
Structured
Data
logistic regression
•Piecewise linear
naïve bayes/TAN/bayesian networks
•Nonparametric
regression
NN
support vector machines
•Classification
Trees
etc.
Models
Prediction
•Linear
regression
•Piecewise linear
•Nonparametric
regression
•Classification
Probability
Distributions
•Parametric models
•Mixtures of
parametric models
•Graphical Markov
models (categorical,
continuous, mixed)
Structured
Data
Models
Prediction
•Linear
regression
•Piecewise linear
•Nonparametric
regression
•Classification
Probability
Distributions
•Parametric models
•Mixtures of
parametric models
•Graphical Markov
models (categorical,
continuous, mixed)
Structured
Data
•Time series
•Markov models
•Mixture Transition
Distribution models
•Hidden Markov
models
•Spatial models
Bias-Variance Tradeoff
High Bias - Low Variance
Score function should
embody the compromise
Low Bias - High Variance
“overfitting” - modeling
the random component
Patterns
Local
Global
•Clustering via
partitioning
•Outlier
detection
•Hierarchical
Clustering
•Changepoint
detection
•Mixture Models
•Bump hunting
•Scan statistics
•Association
rules
Scan Statistics via Permutation Tests
xx
x
xx x
x xx
x
x
xx
xx x
x
x
x
x
xxxx
x
x
x
xxxx
The curve represents a road
Each “x” marks an accident
Red “x” denotes an injury accident
Black “x” means no injury
Is there a stretch of road where there is an unually large
fraction of injury accidents?
xxx x x
Scan with Fixed Window
If we know the length of the “stretch
of road” that we seek, e.g.,
we could slide this window long the
road and find the most “unusual”
window location
xxx x x
x xx
x xx x
xx
x
x
xx
xx x
x
x
x
x
xxxx
x
x
x
xxxx
Spatial-Temporal Scan Statistics
Spatial-temporal scan statistic use
cylinders where the height of the cylinder
represents a time window
Major Data Mining Tasks
Classification: predicting an item
class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur
frequently
Visualization: to facilitate human
discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous
value
270
Link Analysis: finding relationships
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
271
Clustering
Find “natural” grouping of
instances given un-labeled data
272
Association Rules &
Frequent Itemsets
Transactions
Frequent Itemsets:
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
Milk, Bread (4)
Bread, Cereal (3)
Milk, Bread, Cereal (2)
…
Rules:
Milk => Bread (66%)
273
Visualization & Data Mining
Visualizing the
data to facilitate
human discovery
Presenting the
discovered
results in a
visually "nice"
way
274
Summarization
Describe features of the
selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection
or other methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
275
Data Mining Central Quest
Find true patterns
and avoid overfitting
(finding seemingly signifcant
but really random patterns due
to searching too many possibilites)
276
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes
what is the class of new point ?
277
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
278
Classification: Decision Trees
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Y
3
2
X
5
279
DECISION TREE
An internal node is a test on an
attribute.
A branch represents an outcome of
the test, e.g., Color=red.
A leaf node represents a class label
or class label distribution.
At each node, one attribute is chosen
to split training examples into distinct
classes as much as possible
A new instance is280classified by
Classification: Neural Nets
Can select more
complex regions
Can be more accurate
Also can overfit the
data – find patterns in
random noise
281
Evaluating which method works
the best for classification
No model is uniformly the best
Dimensions for Comparison
speed of training
speed of model application
noise tolerance
explanation ability
Best Results: Hybrid, Integrated
models
282
Comparison of Major Classification
Approaches
Train Run Noise Can Use
time Time Toler Prior
ance Knowledge
Decision fast
fast poor no
Trees
Rules
med fast poor no
Accuracy
Underon Customer standable
Modelling
Neural
slow
Networks
Bayesian slow
medium
medium
medium
good
fast
good no
good
poor
fast
good yes
good
good
A hybrid method will have higher accuracy
283
Evaluation of Classification
Models
How predictive is the model we
learned?
Error on the training data is not a
good indicator of performance on
future data
The new data will probably not be
exactly the same as the training data!
Overfitting – fitting the training data
too precisely - usually leads to poor
284
results on new data
Classification:
Train, Validation, Test split
Results Known
+
+
+
Data
Model
Builder
Training set
Evaluate
Model Builder
Y
N
Validation set
Final Test Set
Final Model
285
Predictions
+
+
+
- Final Evaluation
+
-
Cross-validation
Cross-validation avoids overlapping
test sets
First step: data is split into k subsets of
equal size
Second step: each subset in turn is
used for testing and the remainder for
training
This is called k-fold cross-validation
Often the subsets are stratified
before the cross-validation
is
286
Cross-validation example:
—Break up data into groups of the same size
—
—
—Hold aside one group for testing and use the rest to build model
Test
—
—Repeat
287
287
More on cross-validation
Standard method for evaluation:
stratified ten-fold cross-validation
Why ten? Extensive experiments
have shown that this is the best
choice to get an accurate estimate
Stratification reduces the estimate’s
variance
Even better: repeated stratified
cross-validation
288
E.g. ten-fold cross-validation is
Clustering Methods
Many different method and
algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
289
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low
across clusters
290
The distance function
Simplest case: one numeric attribute
A
Distance(X,Y) = A(X) – A(Y)
Several numeric attributes:
Distance(X,Y) = Euclidean distance
between X,Y
Nominal attributes: distance is set to
1 if values are different, 0 if they are
equal
291
Are all attributes equally important?
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster
centers (at random)
2) Assign every item to its nearest
cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the
mean of its assigned items
4) Repeat steps 2,3 until convergence
(change in cluster
assignments less
292
than a threshold)
Data Mining in CRM:
Customer Life Cycle
Customer Life Cycle
The stages in the relationship between a customer
and a business
Key stages in the customer lifecycle
Prospects: people who are not yet customers but
are in the target market
Responders: prospects who show an interest in a
product or service
Active Customers: people who are currently
using the product or service
Former Customers: may be “bad” customers who
did not pay their bills or who incurred high costs
It’s important to know life cycle events (e.g.
retirement)
293
Data Mining in CRM:
Customer Life Cycle
What marketers want: Increasing
customer revenue and customer
profitability
Up-sell
Cross-sell
Keeping the customers for a longer
period of time
Solution: Applying data mining
294
Data Mining in CRM
DM helps to
Determine the behavior surrounding a
particular lifecycle event
Find other people in similar life stages
and determine which customers are
following similar behavior patterns
295
Data Mining in CRM (cont.)
Data Warehouse Customer Profile
Data Mining
Customer Life Cycle Info.
Campaign Management
296
CRISP-DM: Benefits of a standard
methodology
Communication
A common
language
Repeatability
Rational structure
Education
How do I start?
www.crisp-dm.org
CRISP-DM Overview
An industry-standard
process model for data
mining.
Not sector-specific
CRISP-DM Phases:
Business
Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Not strictly ordered respects iterative
aspect of data mining
Non-proprietary
www.crisp-dm.org
Rules vs. decision lists
PRISM with outer loop removed
generates a decision list for one class
Subsequent rules are designed for rules
that are not covered by previous rules
But: order doesn’t matter because all
rules predict the same class
Outer loop considers all classes
separately
No order dependence implied
Problems: overlapping
rules, default
299
Process Standardization
CRISP-DM:
CRoss Industry Standard Process for Data
Mining
Initiative launched Sept.1996
SPSS/ISL, NCR, Daimler-Benz, OHRA
Funding from European commission
Over 200 members of the CRISP-DM SIG
worldwide
DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
Syllogic, Magnify, ..
System Suppliers / consultants - Cap Gemini, ICL Retail,
Deloitte & Touche, …
End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
CRISP-DM
Non-proprietary
Application/Industry
neutral
Tool neutral
Focus on business
issues
As well as technical
analysis
Framework for guidance
Experience base
Templates for
Analysis
Why CRISP-DM?
•The data mining process must be reliable and repeatable by
people with little data mining skills
•CRISP-DM provides a uniform framework for
–guidelines
–experience documentation
•CRISP-DM is flexible to account for differences
–Different business/agency problems
–Different data