Download Introduction - Qing Li,SWUFE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Business Intelligence Technologies
– Data Mining
Lecture 1 Introduction
1
Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining
2
What is covered in this course

Theories/Methods
 Data
mining cycle/process/methodology, evaluation
 Association rules, decision trees, clustering, nearest
neighbor, neural networks, link analysis, Web mining
etc.

Applications
 Market
basket analysis, customer segmentation, CRM,
personalization, Financial analysis etc.
 Business Cases

Hands-on Experience
 SAS
– Enterprise Miner
3
Course Objectives




Understand data mining theories
Learn popular data mining methods
Enable you to solve special business
applications
Master a data mining package
4
Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining
5
Course Logistics

Qing Li


TA



[email protected]
Jia Wang
[email protected]
Office hours:




Walk-in
By appointment
Before and after class
Call me
6
Class Resources

Class homepage:
http://liqing.cai.swufe.edu.cn/ post slides,
announcements, downloads

Text Book + Cases + Handouts
7
Text Book
Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management, Second Edition
Michael Berry and Gordon Linoff, 2004, Wiley, ISBN
0471-470643
8
Class Schedule
Topic
1
Course Overview, Intro to Data Mining
2
Market Basket Analysis & Association Rules, CRM
3
Market Segmentation & Clustering, Prepare data
4
Prediction & Classification – Decision Tree
5
Personalization & Nearest Neighbor
6
Financial Forecasting & Neural Networks
7
Link Analysis & Web mining
8
Misc. Topics
9
Guest Speaker
10
Term project presentations
9
Group Term Project

Group of 2-3 (3 is better).
 Due

one week from now
Identify a company to study
 Focus:
Data and Business Intelligence
 Current practice
 Your recommendations

Two phases
 Phase
1: Identify the company and brief
description (Due 3 weeks from now)
 Phase 2: Final report + class presentation
10
Software

SAS – Enterprise Miner
 Used





for homework assignments
Need Windows XP Professional or Mac OS9
I’ll demo SAS in most classes.
Tutorial available on course website
Every student is recommended to have a copy in
order to follow class demo.
Alternative for Vista users - WEKA
11
Grading

15%
Participation






50%
3: Excellent
2: Good
1: OK
0: Absent with good reason and advance notification
-3: Absent with no reason
Homework
 2 big assignments
 Problem solving, data analysis and/or case discussion.
 25% each

35%
Term Project




Phase 1 report --- 5%
Final report --- 20%
Class presentation --- 5%
Peer evaluation --- 5%
(No Curve)
12
Misc. Issues

Slides are available before class
 Download

or print them before class
Lectures may be different from the text book
 Some
materials in the lectures may not be in the book,
so please focus in class
 The book is a great reference book, not a bible
Finish assigned case readings before
each class
 Attendance is required

13
Survey
14
Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining
15
Case 1: Bank of America

Discussion Questions:
What is BoA trying to achieve?
2. What are the alternative solutions? Pros and
cons of each?
3. What are the stages of data mining?
Describe each.
4. What are the data mining techniques used,
and what are the findings from each
technique?
1.
16
Case 2: A Wireless Company

Discussion Questions:
What is the company trying to achieve?
2. How can data mining help?
3. Where did data come from and How are data
processed?
4. How is the data mining approach evaluated?
1.
17
Case 3: SUV

Discussion Questions:
What is the company trying to achieve?
2. How can data mining help?
3. What data files are used? What information
are contained in these files?
4. How is the two data mining technique
combined and why is it more powerful to
combine?
1.
18
Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining
19
What is data mining?

Informal definition: Finding patterns in data

More formal definition: Non-trivial process of
identifying valid, novel, potentially useful, and
understandable patterns in data

Business Intelligence: a process for increasing
the competitive advantage of a business by
intelligent use of available data in decision
making. (one definition)
20
What is a pattern?

Informal definition: Any structure that can
be found in the data. e.g.
 People
with good credit ratings have fewer
accidents
 Risk = 0.93*prior_default + 0.23*num_cards
–1.3* employed
 On Friday nights male customers who buy
diapers also tend to buy beer

Not every pattern is desirable
 People
with high income buy expensive cars
21
Why Data Mining ?
Because Data Mining virtually affects all data-intensive industry

Marketing




Telecommunications






Which patients may take longer to recover ?
What is the likely cause of an illness ?
Retail


What types of customers have high credit risks / insurance risks ?
What interest rate or insurance premium should be given to different customers?
Which stocks are likely to perform well in the next 3 months?
Healthcare


Which customers will switch to competitors ?
Which calls are fraudulent?
Finance and Insurance


Which customers are likely to respond to this campaign?
What other products or services should be offered to a customer? (cross-selling)
What types of customers are loyal?
Which products do customers buy together (or in sequence)?
Customer Support


Which customer service representative should be assigned to a task ?
When a customer calls, the customer representative’s screen shows exactly where to
lead the conversation.
Wherever there is data, there is and should be data
mining!
22
Why Data Mining ? – Some Real Examples

Safeway:




Pfizer pharmaceuticals:





Cross selling, when a customer calls, know what other services to offer
Build models to figure out what makes a loyal customer
These models saved a marginally profitable bill-paying service
Amazon:


Construct a predictive model which tells patients their cholesterol risk
score.
High risk patients can request Lipitor, Pfizer’s cholesterol medication.
Fidelity:


Shopper cards capture point-of-sale data and personal information.
Arrange products on shelves: Beer & Diaper
Sell names to suppliers so that manufacturer coupons can be targeted.
Recommendations
Capital One:


What terms should be offered to different customers?
The lowest loan loss rates in the industry
23
Why Data Mining Now?
Better and cheaper
Computing
Power
Mature
data mining
technology
DM
Improved Data
Collection
& Storage
Plus: Data is being produced at a tremendous speed.
Competitive pressures are enormous
24
Descriptive vs. Predictive Data Mining

Descriptive DM is used to learn about and understand
the data.



What items are purchased together?
Identify and describe groups of customers with common buying
behavior
Predictive DM aims to build models in order to predict
unknown values of interest.




A model that given a customer’s characteristics predicts how
much the customer will spend on the next catalog order.
Predicting which customers are likely to leave
Which direction is Stock X going to move tomorrow?
Most predictive models are also descriptive
25
Data Mining Software

Big Names:






IBM Intelligent Miner
SPSS Clementine
Microsoft SQL Server 2000 Analysis Service
Oracle 9i Data Mining
SAS Enterprise Miner
Smaller Companies:


ANGOSS KnowledgeStudio
XLMiner
 MegaPuter PolyAnalyst
 DBMiner

Free or Open Source:



Weka
Lots of free programs on the Internet supporting individual data mining
techniques.
A good portal for data mining related stuff:

http://www.kdnuggets.com
26
Virtuous Cycle of Data Mining



Finding patterns is not enough
Must respond to the patterns
by taking action
Turning:



Data into Information
Information into Action
Action into Value
1, Identify the business problem
2, Mining data to transform the data
into actionable information
3, Acting on the information
4, Measuring the results
27
1, Identify the Business Opportunity




Many business processes are good candidates:

New product introduction

Direct marketing campaign

Understanding customer attrition/churn

Evaluating the results of a test market
Or more specific problems

What types of customers responded to our last campaign?

Where do the best customers live?

Are long waits in check-out lines a cause of customer attrition?

What products should be promoted with our XYZ product?
TIP: When talking with business users about data mining
opportunities, make sure you focus on the business
problems/opportunities and not on technology and algorithms.
Another goal of this course is for you to think strategically about
what business opportunities can be addressed by data mining
techniques.
28
2, Mining the Data to Transform it into Actionable Information



Success is making business sense of the data
Need to figure out the specific data mining tasks used to
address the business opportunities identified in the first
step.
Deal with messy data


Implementation problems:




Don’t expect clean data. Data cleaning accounts for 70% of efforts
What techniques to use?
How to use the techniques?
Selecting the right model
Other problems

Data privacy issue
29
3, Take Action



Taking action is the whole purpose of data mining
Now with discovered patterns (from mining data),
we have better informed decisions.
Examples
 Contact
targeted customers
 Prioritizing customer service

Cingular and AT&T were fined for $1.5 million on Sept. 10,
2004 for discriminating their services based on customers’
credit rating.
 Adjusting
inventory levels
 Rearrange products on the shelves
 Verizon sends out 40k mails to selected customers per
month
30
4, Measuring Results




Assess the impact of the action taken
Often overlooked, ignored, skipped
Planning for the measurement should begin
when analyzing the business opportunity, not
after it is “all over”
Assessment questions (examples):
 Did
this campaign do what we hoped?
 Did some offers work better than others?
 Lower cost, increase profit?
 Tons of others…
31
Data Mining General Guidelines
The DM virtuous cycle (4 steps) is iterative
 No steps should be skipped
 Common sense prevails with respect to
how rigorous each step is carried out
 The 4 steps of the virtuous cycle expand to
become an 11-step methodology --- more
rigorous

32
Detailed Data Mining Process – 11 Steps
1, Translate the
business problem into a
data mining problem
2, Select appropriate
data
3, Get to know the data
4, Create a model set
5, Fix problems with the
data
6, Transform data to
bring information to the
surface
7, Build models
8, Assess models
9, Deploy models
10, Assess results
11, Begin again
33
Step 1: Transforming Business Problems into DM
Tasks



Business problems can often be big and vague
Data mining tasks need to be more concrete
Sample business problems:



How to improve response to a direct marketing
campaign?
Which ads to place on web pages in order to improve
click thorough rate?
How to transform these to DM task?
34
Step 2-6: Data Preparation

Get data




Clean/correct data




Different (heterogeneous) sources
Need to collect additional data?
Credit card charge records, points-of sale, web log etc.
Correct errors
Add missing values
Discard of garbage, remove outliers
Transform data if needed



Derived attributes --- bring information to the surface
Income  Income bracket when model requires categorical data
DOB  Age
35
Step 7-9: Model Building
Choice of model, model building and model assessment

Decide what model type to use




Descriptive or Predictive model?
Which specific technique?
Often can try different techniques
Things to consider:





Assess Models




Computational issues
Implementation issues
Availability of relevant and amount of data
Do we have the necessary expertise
Accuracy on testing data
Small is beautiful
Easier to understand
Step 9 is more about scoring or ranking in the real data
36
Step 10: Assess the Result



It’s not model accuracy any more
It’s more about achieving the business goal
It’s closely related to business decisions


E.g. if it’s more expensive to deploy a data mining
model, a mass mailing may be more cost-effective
than a targeted one.
But it’s often hard to isolate the effect of a
solution. Indirect benefits may be hard to see.

Do a market test
37
Common Data Mining Mistakes

Learn things that aren’t true

Patterns may not represent any underlying rule



The data may not reflect the relevant population




The sample should not be biased
Otherwise, the result can not be extended
E.g. Your existing customers are not like the customers you want to acquire
Data may be at the wrong level of detail


Tall candidates win presidential election
True in data, but has no predictive power
Refer to the Simpson’s paradox (next slide)
Learn things that are true, but not useful

Things that are already known

Majority of rules learned are normal business rules


E.g. Retired employees don’t respond to retirement plan promotion
Things that can’t be used (AT&T/Cingular example)

Inability to act upon patterns because of political, legal and ethical reasons
38
Simpson’s Paradox
Male
Female


Business and Law
Schools
Admit
Deny Total
490 (70%) 210 (30%)
700
280 (56%) 220 (44%)
500
Male
Business School
Admit
Deny Total
480 (80%) 120 (20%)
600
Male
Female
180 (90%)
Female
20 (10%)
200
Law
Admit
Deny Total
10 (10%) 90 (90%)
100
100 (33%) 200 (66%)
300
Simpson’s Paradox refers to the reversal of the direction
of a comparison or an association when data from
several groups are combined to form a single group.
This is caused by the different percentages in admission
in the two tables - they really shouldn't be combined.
39
What to Do After Class
Read Chapter 1, 2, 3
 Read cases for Lecture 2
 Install SAS
 Find a group member for your term
project and start thinking about which
company to select for your project

40