Download nyu_short

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Datamining in e-Business: Veni, Vidi, Vici!
Prof. Dr. Veljko Milutinovic:
•Coarchitect of the World's first 200MHz RISC microprocessor,
for DARPA, about a decade before Intel.
•Responsible for several successful datamining-oriented e-business on the Internet
products, developed in cooperation with leading industry in the USA and Europe.
•Consulted for a number of high-tech companies
(TechnologyConnect, BioPop, IBM, AT&T, NCR, RCA, Honeywell, Fairchild, etc...)
•Ph.D. from Belgrade. After that, for about a decade, on various positions (professor)
at one of the top 5 (out of about 2000) US universities in computer engineering (Purdue).
•Author and coauthor of about 50 IEEE journal papers (plus many more in other journals).
According to some, a European record for his research field.
•Guest editor for a number of special issues of: Proceedings of the IEEE, IEEE
Transactions on Computers, IEEE Concurrency, IEEE Computer, etc…
•Over 20 books published by the leading USA publishers
(Wiley, Prentice-Hall, North-Holland, Kluwer, IEEE CS Press, etc...).
•Forewords for 7 of his books written by 7 different Nobel Laureates, in cooperation with
Telecom Italia learning services (you are welcome to visit http://www.ssgrr.it).
[email protected]
http://galeb.etf.bg.ac.yu/~vm
Page Number: 1
Issues in Data Mining
Infrastructure
Authors:
Nemanja Jovanovic, [email protected]
Valentina Milenkovic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]
http://galeb.etf.bg.ac.yu/~vm
Page Number: 2
Data Mining in the Nutshell

Uncovering the hidden knowledge

Huge n-p complete search space

Multidimensional interface
NOTICE:
All trademarks and service marks mentioned in this document are marks of their
respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering
Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA)
and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted
presentation of their process model.
Page Number: 3
A Problem …
You are a marketing manager
for a cellular phone company

Problem: Churn is too high

Turnover (after contract expires) is 40%

Customers receive free phone (cost 125$)

You pay a sales commission of 250$ per contract

Giving a new telephone to everyone
whose contract is expiring is expensive

Bringing back a customer after quitting
is both difficult and expensive
Page Number: 4
… A Solution

Three months before a contract expires,
predict which customers will leave

If you want to keep a customer
that is predicted to churn,
offer them a new phone

The ones that are not predicted to churn
need no attention

If you don’t want to keep the customer, do nothing

How can you predict future behavior?

Tarot Cards?

Magic Ball?

Data Mining?
Page Number: 5
Still Skeptical?
Page Number: 6
The Definition
The automated extraction
of predictive information
from (large) databases

Automated

Extraction

Predictive

Databases
Page Number: 7
History of Data Mining
Page Number: 8
Repetition in Solar Activity

1613 – Galileo Galilei

1859 – Heinrich Schwabe
Page Number: 9
The Return of the
Halley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
2061 ???
Page Number: 10
Data Mining is Not

Data warehousing

Ad-hoc query/reporting

Online Analytical Processing (OLAP)

Data visualization
Page Number: 11
Data Mining is

Automated extraction
of predictive information
from various data sources

Powerful technology
with great potential to help users focus
on the most important information
stored in data warehouses
or streamed through communication lines
Page Number: 12
Data Mining can

Answer question
that were too time consuming
to resolve in the past

Predict future trends and behaviors,
allowing us to make proactive,
knowledge driven decision
Page Number: 13
Focus of this Presentation

Data Mining problem types

Data Mining models and algorithms

Efficient Data Mining

Available software
Page Number: 14
Data Mining
Problem Types
Page Number: 15
Data Mining Problem Types

6 types

Often a combination solves the problem
Page Number: 16
Data Description and
Summarization

Aims at concise description
of data characteristics

Lower end of scale of problem types

Provides the user an overview
of the data structure

Typically a sub goal
Page Number: 17
Segmentation

Separates the data into
interesting and meaningful
subgroups or classes

Manual or (semi)automatic

A problem for itself
or just a step
in solving a problem
Page Number: 18
Classification

Assumption: existence of objects
with characteristics that
belong to different classes

Building classification models
which assign correct labels in advance

Exists in wide range of various applications

Segmentation can provide labels
or restrict data sets
Page Number: 19
Concept Description

Understandable description
of concepts or classes

Close connection to both
segmentation and classification

Similarity and differences
to classification
Page Number: 20
Prediction (Regression)

Finds the numerical value
of the target attribute
for unseen objects

Similar to classification - difference:
discrete becomes continuous
Page Number: 21
Dependency Analysis

Finding the model
that describes significant dependences
between data items or events

Prediction of value of a data item

Special case: associations
Page Number: 22
Data Mining Models
Page Number: 23
Neural Networks

Characterizes processed data
with single numeric value

Efficient modeling of
large and complex problems

Based on biological structures
- Neurons

Network consists of neurons
grouped into layers
Page Number: 24
Neuron Functionality
I1
W1
I2
W2
I3
W3
In
f
Output
Wn
Output = f (W1*I1, W2*I1, …, Wn*In)
Page Number: 25
Training Neural Networks
Page Number: 26
Neural Networks

Once trained, Neural Networks
can efficiently estimate the value
of an output variable for given input

Neurons and network topology
are essentials

Usually used for prediction
or regression problem types

Difficult to understand

Data pre-processing often required
Page Number: 27
Decision Trees

A way of representing a series of rules
that lead to a class or value

Iterative splitting of data
into discrete groups
maximizing distance between them
at each split

Classification trees and regression trees

Univariate splits and multivariate splits

Unlimited growth and stopping rules

CHAID, CHART, Quest, C5.0
Page Number: 28
Decision Trees
Balance>10
Balance<=10
Age<=32
Married=NO
Age>32
Married=YES
Page Number: 29
Decision Trees
Page Number: 30
Rule Induction

Method of deriving a set of rules
to classify cases

Creates independent rules
that are unlikely to form a tree

Rules may not cover
all possible situations

Rules may sometimes
conflict in a prediction
Page Number: 31
Rule Induction
If balance>100.000
then confidence=HIGH & weight=1.7
If balance>25.000 and
status=married
then confidence=HIGH & weight=2.3
If balance<40.000
then confidence=LOW & weight=1.9
Page Number: 32
K-nearest Neighbor and
Memory-Based Reasoning (MBR)

Usage of knowledge
of previously solved similar problems
in solving the new problem

Assigning the class to the group
where most of the k-”neighbors” belong

First step – finding the suitable measure
for distance between attributes in the data

+ Easy handling of non-standard data types

- Huge models
Page Number: 33
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Page Number: 34
Data Mining Models
and Algorithms

Many other available models and algorithms

Logistic regression

Discriminant analysis

Generalized Adaptive Models (GAM)

Genetic algorithms

Etc…

Many application specific variations
of known models

Final implementation usually involves
several techniques

Selection of solution that match best results
Page Number: 35
Efficient Data Mining
Page Number: 36
NO
YES
Is It Working?
Don’t Mess With It!
YES
Did You Mess
With It?
You Shouldn’t Have!
NO
Anyone Else
Knows?
NO
Hide It
YES
You’re in TROUBLE!
NO
Can You Blame
Someone Else?
YES
NO PROBLEM!
Page Number: 37
YES
Will it Explode
In Your Hands?
NO
Look The Other Way
DM Process Model

5A – used by SPSS Clementine
(Assess, Access, Analyze, Act and Automate)

SEMMA – used by SAS Enterprise Miner
(Sample, Explore, Modify, Model and Assess)

CRISP – tends to become a standard
Page Number: 38
CRISP - DM

CRoss-Industry Standard for DM

Conceived in 1996 by three companies:
Page Number: 39
CRISP – DM methodology
Four level breakdown of the CRISP-DM methodology:
Phases
Generic Tasks
Specialized Tasks
Process Instances
Page Number: 40
Mapping generic models
to specialized models

Analyze the specific context

Remove any details not applicable to the context

Add any details specific to the context

Specialize generic context according to
concrete characteristic of the context

Possibly rename generic contents
to provide more explicit meanings
Page Number: 41
Generalized and Specialized
Cooking

Preparing food on your own

Raw
Find
out what
youvegetables?
want to eat
stake
with


Find the recipe for that meal
Check the Cookbook or call mom
Gather the ingredients
Defrost the meat (if you had it in the fridge)
Prepare the meal
Buy missing ingredients
Enjoy
yourthe
food
or borrow
from the neighbors
Clean up everything (or leave it for later)
Cook the vegetables and fry the meat

Enjoy your food or even more

You were cooking
so convince someone else to do the dishes







Page Number: 42
CRISP – DM model

Business understanding

Data understanding

Data preparation

Modeling
Business
understanding
Deployment

Evaluation

Deployment
Evaluation
Page Number: 43
Data
understanding
Data
preparation
Modeling
Business Understanding

Determine business objectives

Assess situation

Determine data mining goals

Produce project plan
Page Number: 44
Data Understanding

Collect initial data

Describe data

Explore data

Verify data quality
Page Number: 45
Data Preparation

Select data

Clean data

Construct data

Integrate data

Format data
Page Number: 46
Modeling

Select modeling technique

Generate test design

Build model

Assess model
Page Number: 47
Evaluation
results = models + findings

Evaluate results

Review process

Determine next steps
Page Number: 48
Deployment

Plan deployment

Plan monitoring and maintenance

Produce final report

Review project
Page Number: 49
At Last…
Page Number: 50
WWW.NBA.COM
Page Number: 51
Se7en
Page Number: 52
 CD – ROM 
Page Number: 53
Evolution of Data Mining
Evolutionary Step
Business Question
Enabling
Technologies
Product Providers
Characteristics
Data Collection
(1960s)
What was my average
total revenue over the
last 5 years?
Computers,
tapes,
disks
IBM,
CDC
Retrospective,
static data delivery
Data Access
(1980s)
What were unit sales
in New England
last March?
RDBMS,
SQL,
ODBC
Oracle, Sybase
Informix, IBM,
Microsoft
Retrospective,
dynamic data delivery
at record level
Data Navigation
(1990s)
What were unit sales
in New England last
March?
Drill down to Boston.
OLAP,
Multidimensional
databases,
data warehouses
Pilot, IRI,
Arbor, Redbrick,
Evolutionary
Technologies
Retrospective,
dynamic data delivery
at multiple levels
Data Mining
(2000)
What’s likely to
happen to Boston unit
sales next month?
Why?
Advanced algorithms,
multiprocessors,
massive databases
Lockheed,
IBM, SGI,
numerous startups
Prospective, proactive
information delivery
Page Number: 54
Examples of DM projects to stimulate your imagination

Here are six examples of how data mining is helping corporations
to operate more efficiently and profitably in today's business environment
– Targeting a set of consumers
who are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chips
– Predicting audience share for television programs
– Predicting the probability that a cancer patient
will respond to radiation therapy
– Predicting the probability that an offshore oil well is actually going
to produce oil
Page Number: 55
Comparison of fourteen DM tools





Evaluated by four undergraduates inexperienced at data mining,
a relatively experienced graduate student, and
a professional data mining consultant
Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5
Use one of the four technologies:
Decision Trees, Rule Inductions, Neural, or Polynomial Networks
Solve two binary classification problems:
multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$
Page Number: 56
Comparison of fourteen DM tools




The Decision Tree products were
- CART
- Scenario
- See5
- S-Plus
The Rule Induction tools were
- WizWhy
- DataMind
- DMSK
Neural Networks were built from three programs
- NeuroShell2
- PcOLPARS
- PRW
The Polynomial Network tools were
- ModelQuest Expert
- Gnosis
- a module of NeuroShell2
- KnowledgeMiner
Page Number: 57
Criteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:

Capability measures what a desktop tool can do,
and how well it does it
- Handles missing data
- Considers misclassification costs
- Allows data transformations
- Includes quality of tesing options
- Has a programming language
- Provides useful output reports
- Provides visualisation
Page Number: 58
-
Visualisation
+ excellent capability  good capability - some capability “blank” no capability
Page Number: 59
Criteria for evaluating DM tools

Learnability/Usability shows how easy a tool is to learn and use
-
Tutorials
Wizards
Easy to learn
User’s manual
Online help
Interface
Page Number: 60
Criteria for evaluating DM tools

Interoperability shows a tool’s ability to interface
with other computer applications
- Importing data
- Exporting data
- Links to other applications

Flexibility
- Model adjustment flexibility
- Customizable work enviroment
- Ability to write or change code
Page Number: 61
Data Input & Output Model
+ excellent capability
 good capability
- some capability
“blank” no capability
Page Number: 62
A classification of data sets

Pima Indians Diabetes data set
–
–

Wisconsin Breast Cancer data set
–
–

699 instances of breast tumors some of which are malignant,
most of which are benign
10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set
–
–

768 cases of Native American women from the Pima tribe
some of whom are diabetic, most of whom are not
8 attributes plus the binary class variable for diabetes per instance
214 instances of glass collected during crime investigations
10 attributes plus the multi-class output variable per instance
Moon Cannon data set
–
–
300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g
the data were generated without adding noise
Page Number: 63
Evaluation of forteen DM tools
Page Number: 64
Strenghts and Weaknesses
Strengths:
Weaknesses:





Ease of use
(Scenario, WizWhy..)
Data visualisation
(S-plus,MineSet...)
Depth of algorithms (tree options)
(CART,See5,S-plus..)
Multiple nn architectures
(NeuroShell)


Difficult file I/O
(OLPARS,CART)
Limited visualisation
(PRW,See5,WizWhy)
Narrow analyses path
(Scenario)
Page Number: 65
How to improve existing DM applications
The top ten points:
 Database integration
– no more flat files
– use the millions $ spent on data warehousing
 Automated model scoring
– without scoring DM is pretty useless
– should be integrated with the driving applications
 Exporting models to other applications
– close the loop between DM and applications
that need to use the results (scores)
Page Number: 66
How to improve existing DM applications

Business templates
– cross-selling specific application is more valuable
than a general modeling tool
 Effort knob
– it is relevant in a way that tuning parametars are not
 Incorporate financial information
– the financial information is very important and often available
and should be provided as input to the DM application
Page Number: 67
How to improve existing DM applications

Computed target columns
– allow the user to interactively create a new target variable
 Time-series data
– a year’s worth of monthly balance information is qualitatively
different than twelve distinct non-time-series variables
 Use versus View
– do not present visually to user the full model,
only the most important levels
 Wizards
– not necessarily but desirable
– prevent human error by keeping the user on track
Page Number: 68
Potential Applications
Data mining has many varied fields of application,
some of which are listed below:

Retail/Marketing

Identify buying patterns from customers

Find associations among customer demographic characteristics

Predict response to mailing campaigns

Market basket analysis
Page Number: 69
Potential Applications
• Banking

Detect patterns of fraudulent credit card use

Identify `loyal' customers

Determine credit card spending by customer groups

Find hidden correlations between different financial indicators

Identify stock trading rules from historical market data
Page Number: 70
Potential Applications
• Insurance and Health Care

Claims analysis - i.e., which medical procedures are claimed together

Predict which customers will buy new policies

Identify behaviour patterns of risky customers

Identify fraudulent behaviour
Page Number: 71
Potential Applications
• Transportation

Determine the distribution schedules among outlets

Analyse loading patterns
• Medicine

Characterise patient behaviour to predict office visits

Identify successful medical therapies for different illnesses

To predict the effectiveness of surgical procedures or
medical tests
Page Number: 72
Potential Applications
• Sport

To make the best choice about players in different circumstance

To predict the results of relevance match

Do a better list of seed players in groups or tournament
 DM report from an NBA game
When Price was Point-Guard,
J.Williams missed 0% (0) of his jump field-goal attempts,
and made 100% (4) of his jump field-goal-attempts.
The total number of such field-goal-attempts was 4.
Page Number: 73
DM and Customer Relationship Management

CRM is a process that manages the interactions
between a company and its customers
 Users of CRM software applications are database marketers
 Goals of database marketers are:
 identifying market segments, which requires significant data
about prospective customers and their buying behaviors
 build and execute campaigns

Tightly integrating the two disciplines presents an opportunity
for companies to gain competetive advantage
Page Number: 74
DM and Customer Relationship Management





How Data Mining helps Database Marketing
Scoring
The role of Campaign Management Software
Increasing the customer lifetime value
Combining Data Mining and Campaign Management
Page Number: 75
DM and Customer Relationship Management

Evaluating the benefits of a Data Mining model
Gains chart
Profability chart
Page Number: 76
Data Mining Examples

Bass Brewers
“We’ve been brewing beer since 1777. With increased competition
comes a demand to make faster/better decisions”
 Northern Bank
“The information is now more accessible, paperless, and timely.”
 TSB Group Plc
“We are using Holos because of its flexibility and its excellent
multidimensional database”
Page Number: 77
Data Mining Examples

Delphic University
“Real value is added to data by multidimensional manipulation
(being able to easily compare many different views
of the avaible information in one report) and by modeling.”
 Harvard - Holden
“Sybase technology has allowed us to develop an information
system that will preserve this legacy into the twenty-first century.”
 J.P.Morgan
“The promise of data mining tools like Information Harvester is
that they are able to quickly wade through massive amounts
of data to identify relationships or trending information
that would not have been available without the tool.”
Page Number: 78
Securities Brokerage Case Study

The following four pages are derived
from a copyrighted case study
originally created by SmartDrill Data Mining
(Marlborough, MA, U.S.A.).
 Their website is:
http://smartdrill.com
 And the original case study appears in its entirety here:
http://smartdrill.com/CHAID.html
Page Number: 79
Securities Brokerage Case Study

Predictive market segmentation model designed to identify
and profile high-value brokerage customer segments
as targets for special marketing communications efforts.
 The dependent variable for this ordinal CHAID model
is brokerage account commission dollars during the past 12 months
 We begin by splitting the client's entire customer file
into a modeling sample and a validation sample.
(Once the model is built using the modeling sample,
we apply it to the validation sample to see how well it works
on a sample other than the one on which it was built).
Page Number: 80
Securities Brokerage Case Study

The resulting CHAID model has 55 segments.
 However, the results are summarized in the following comb chart,
showing the segment indexes (indexes of average dollar value)
Page Number: 81
Securities Brokerage Case Study
The part of Gains Chart: Average Annual Brokerage Commission Dollars
Gains chart provides
quantitative detail useful
for financial and marketing
planning.

We have highlighted the
top 20% of the file in blue

The top 20% of the file
is worth an average
of about $334 per account,
which is nearly three times
the average account value
for the entire sample.

…
…
…
…
Page Number: 82
…
…
…
…
...
Securities Brokerage Case Study

Using the data in the gains chart,
we can better plan our communications/promotion budget.
 In general, the best segments represent customers who are
experienced, aggressive, self-directed traders.
 The other decisions that can help us:
 We might wish to conduct some market research among customers
in under-performing segments, or among under-performing customers
in the better segments
 We can use the segment definitions to help us identify possible issues and
question areas to include in the survey

Before we try to apply such a model, we perform a validation
against a holdout sample, to confirm that it is a good model.
Page Number: 83
References

Bruha, I., ‘Data Mining, KDD and Knowledge Integration:
Methodology and A case Study”,
SSGRR 2000

Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R.,
“Advances in Knowledge Discovery and Data Mining”,
MIT Press, 1996

Glumour, C., Maddigan, D., Pregibon, D., Smyth, P.,
“Statistical Themes nad Lessons for Data Mining”,
Data Mining And Knowledge Discovery 1, 11-28, 1997

Hecht-Nilsen, R., “Neurocomputing”,
Addison-Wesley, 1990

Pyle, D., “Data Preparation for Data Mining”,
Morgan Kaufman, 1999

www.thearling.com

www.crisp-dm.com

www.twocrows.com

www.sas.com/products/miner

www.spss.com/clementine

galeb.etf.bg.ac.yu/~vm
Page Number: 84
Potentials of R&D
in
Cooperation with U. of Belgrade
An Overview of Advanced Datamining Projects
for High-Tech Computer Industry
in the USA and EU
VLSI Detection
for
Internet/Telephony Interfaces
Goran Davidović, Miljan Vuletić, Veljko Milutinović,
Tom Chen, and Tom Brunett

* eT
Page Number: 86
USERS...
...
Superposition/DETECTION
Superposition/DETECTION
SPECIALIZED
INTERNET
REMOTE
SITE
SERVICE
PROVIDER
HOME/OFFICE/FACTORY AUTOMATION ON THE INTERNET
Page Number: 87
Reconfigurable FPGA for EBI
Božidar Radunović, Predrag Knežević, Veljko Milutinović,
Steve Casselman, and John Schewel*

* Virtual
Page Number: 88
USERS
...
SPECIALIZED
INTERNET
SERVICE
PROVIDER
VCC
VCC
CUSTOMER SATISFACTION vs CUSTOMER PROFILE
Page Number: 89
BioPoP
Veljko Milutinovic, Vladimir Jovicic, Milan Simic,
Bratislav Milic, Milan Savic, Veljko Jovanovic,
Stevo Ilic, Djordje Veljkovic, Stojan Omorac,
Nebojsa Uskokovic, and Fred Darnell
•isItWorking.com
Page Number: 90
Testing the Infrastructure for EBI

Phones
 Faxes
 Email
 Web links
 Servers
 Routers
 Software
• Statistics
• Correlation
• Innovation
Page Number: 91
CNUCE
Integration and Datamining
on Ad-Hoc Networks and the Internet
Veljko Milutinović,
Luca Simoncini, and Enrico Gregory

*University of Pisa, Santanna, CNUCE
Page Number: 92
GSM
DM
Ad-Hoc
Page Number: 93
Internet
Genetic Search
with Spatial/Temporal Mutations
Jelena Mirković, Dragana Cvetković,
and Veljko Milutinović

*Comshare
Page Number: 94
Drawbacks of INDEX-BASED:
Time to index + ranking
Advantages of LINKS-BASED:
Mission critical applications + customer tuned ranking
Well organized markets: Best first search
If elements of disorder: G w DB mutations
Chaotic markets: G w S/T mutations
Provider
Page Number: 95
e-Banking on the Internet
Miloš Kovačević, Bratislav Milic, Veljko Milutinović,
Marco Gori, and Roberto Giorgi

*University of Siena
Page Number: 96
Bottleneck#1: Searching for Clients and Investments
1472++
*University of Siena + Banco di Monte dei Paschi
Page Number: 97
SSGRR
Organizing Conferences via the Internet
Zoran Horvat, Nataša Kukulj, Vlada Stojanović,
Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović,
Veljko Milutinović, and Frederic Patricelli

*SSGRR, L’Aquila
Page Number: 98
http://www.ssgrr.it
2000:
Arno Penzias
2001:
Bob
Richardson
2002:
Jerry Friedman
2003:
Harry Kroto
Page Number: 99
Summary
Books with Nobel Laureates:
Kenneth Wilson, Ohio (North-Holland)
Leon Cooper, Brown (Prentice-Hall)
Robert Richardson, Cornell (Kluwer-Academics)
Herb Simon (Kluwer-Academics)
Jerome Friedman, MIT (IOS Press)
Harold Kroto (IOS Press)
Arno Penzias (IOS Press)
Page Number: 100
http://galeb.etf.bg.ac.yu/~vm/
e-mail: [email protected]
Page Number: 101