Download Data Mining - BYU Data Mining Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Process, Key Success Factors, Illustrations
Data Mining in the BI Context
Data Extraction
Data Storage
Business Intelligence
Collecting / Transforming
Storing / Aggregating / Historising
Visualization
Reporting / EIS / MIS
Exploration
OLAP
Data Analysis
Discovery
Data Mining
What Is Data Mining?
Business Definition
• Deployment of business processes, supported
by adequate analytical techniques, to:
• Take further advantage of data
• Discover relevant knowledge
• Act on the results
CRISP-DM
Business
Understanding
Data
Understanding
Data
Preparation
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Collect Initial Data
Initial Data Collection
Report
Describe Data
Data Description Report
Select Data
Rationale for Inclusion /
Exclusion
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Explore Data
Data Exploration Report
Clean Data
Data Cleaning Report
Verify Data Quality
Data Quality Report
Construct Data
Derived Attributes
Generated Records
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data Set
Data Set Description
Integrate Data
Merged Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Format Data
Reformatted Data
DOCUMENT EVERYTHING!
Data Mining Tasks
• Summarization
• Classification / Prediction
• Classification, Concept learning, Regression
• Clustering
• Dependency modeling
• Anomaly detection
• Link Analysis
Human Resources
Survey and Online Game
Do They Know Us?
knewELCA vs {Gender, Semester, SchoolType} for EPFL only
Student
Semestre 1 ou 2
Semestre > 2
NON
OUI
Who Plays?
SchoolType vs {Gender, Year, knewELCA, score} for score > 0 only
Student
Score <= 31525
Première Année
Homme
FH
Score > 31525
Milieu de Cycle
Femme
Score <= 20426
Score > 20426
EP
FH
Fin de Cycle
Score <= 25008
Score > 25008
knewELCA
FH
EP
EP
HES
Not knewELCA
Score <= 18914
Score > 18914
UNI
FH
How Well Do They Do?
Score vs {Gender, Year, SchoolType, knewELCA} for score > 0 only
Student
FH
EP
knewELCA
Not knewELCA
Fair
Good
Fair
0-13136
13136-19453
19453-25769
25769-32086
32086+
UNI
Première Année
Good
HES
Milieu de Cycle
Fin de Cycle
Homme
Femme
Excellent
Fair
Poor
Fair
Good
Excellent
Outstanding
21
91
90
39
15
Fair
Première Année
knewELCA
Not knewELCA
Good
Poor
Milieu de Cycle
Fair
Fin de Cycle
knewELCA
Not knewELCA
Outstanding
Excellent
Subscription Retail
Situation & Goal
• Poor understanding of customers and behaviors
• Short audit:
• Nice DWH, only 2 years old, not fully populated
• Limited data on purchases and subscriptions
• Potential goals:
• Associations of products that sell together
• Segmentation of customers
Summarization / Aggregation
• Revenue distribution
• 80% generated by 41.5% of subscribers
• 60% generated by 18.3% of subscribers
• 42.9% generated by top 5 products
• Simple customer classes
• Over 65 years old most profitable
• Under 16 years old least profitable
• Birthdate filled-in for only about 10% of
subscribers!
Product Association
• About 21% of subscribers buy P4, P7 and P9
• P4 is most profitable product
• P7 is ranked 6th
• P9 is ranked 15th with only 2%
of revenue
P9
P1
P8
P2
P7
P3
P4
• Several possible actions
• Make a bundle offering of these products
• Cross-sell from P9 to P4
• Temptation to remove P9 should be resisted
P6
P5
Clustering
30% of customers who
buy a single yearly
product
!!!
Summary of Findings
• Data Mining found:
• A small percentage of the customers is responsible for a large
share of the sales
• Several groups of « strongly-connected » articles
• A sizeable group of subscribers who buy a single article
• Lessons learned:
• First 2 findings: « we knew that! » (BUT: scientific
confirmation of business observation)
• 3rd finding: « we could target these customers with a special
offer! »
• Lack of relevant data: the structure is in place but not being
used systematically
Campaign Management
Situation & Goal
Lift
Cumulative Response
30
Lift(c) = CR(c) / c
100
90
% respondents
80
Example:
Lift(25%)= CR(25%) / 25%
= 62% / 25% = 2.5
70
60
50
If we send to 25% of our
prospects using the model,
they are 2.5 times as likely
to respond than if we were
to select them randomly.
40
30
20
10
0
0
0
0
10
20
30
40
50
60
% prospects
70
80
90 100
30,000
Expected ROI
Assume:
200 seminars per year
€0.41 stamp
€200 per seminar
Send half as many, same response
(from 0.1% to 0.2% response rate)
Approach & Cost
Kick-off
Etude Préliminaire
Analyse de
l’existant
Fixed price: €5,000
Analyse des
besoins
Consolidation
AUDIT
Rapport d’audit
APPLICATION
IN VITRO
Decision:
Sélection de
campagnes
Analyse
opérationnelle
Exécution/analyse
a posteriori
Déploiement
Modèles & ROI
Prise de
Décision
APPLICATION
IN VIVO
Implémentation
No
!?!
Laws of Data Mining
Eight Laws (I)
• Business/domain objectives are the origin of every
data mining solution
• Business/domain knowledge is central to every step
of the data mining process
• Data preparation is more than half of every data
mining process
• The right model for a given application can only be
discovered by experiment
Eight Laws (II)
• There are always patterns
• Data mining amplifies perception in the domain
• The value of data mining results is not determined
by the accuracy or stability of predictive models
• All patterns are subject to change
The Right Expectation
• Data Mining is unlikely to produce surprising results
that will utterly transform a business. Rather:
• Early results: insights about data and scientific
confirmation of human experience/intuition
• Beyond: steady improvement to an already successful
organization
• Occasionally: discovery of one rare/highly valuable
piece of knowledge
The Right Organization
• Data Mining is not sophisticated enough to be
substituted for domain knowledge or for experience
in analysis and model building.
• Rather:
• Data Mining is a joint venture
• “… put teams together that have a variety of skills
(e.g., statistics, business and IT skills), are creative and
are close to the business thinking .”
Key Success Factors
• Have a clearly articulated business problem that
needs to be solved and for which Data Mining is the
adequate technology
• Ensure that the problem being pursued is supported
by the right type of data of sufficient quality and in
sufficient quantity
• Recognize that Data Mining is a process with many
components and dependencies
• Plan to learn from the Data Mining process
whatever the outcome
Essential Tips
Tips (I)
• Don’t wait to get started – the competition is only a
mouse click away
• Begin with the end in mind
• It’s the decision maker, stupid!
• Unless there’s a method, there’s madness
• Better data means better results
Tips (II)
• Twyman’s law: any statistic that appears interesting
is almost certainly a mistake (double-check all
findings)
• Avoid the OLAP trap
• Deployment is the key to data mining ROI
• Champions train so they can win the race
Crawl, Walk, Run
• Exploratory Workshop / Brainstorm
• Identify potential profitable applications
• Data Audit
• Assess data quality and relevance
• Identify shortcomings
• Suggest ways to enrich data (internal and external)
• Domain-relevant Case Studies (start small)
• Refine list of applications to produce well-defined,
actionable, domain-relevant case studies
• Select 1 or more case studies as « pilots »
• Scale-up