Download Data Mining Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Process tracing wikipedia , lookup

Neuroinformatics wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Pattern recognition wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Transcript
Data Mining Analytics for
Business Intelligence and
Decision Support
Chid Apte, PhD
Manager, Data Abstraction Research Group
IBM TJ Watson Research Center
[email protected]
http://www.research.ibm.com/dar
Overview
• Knowledge discovery and data mining (KDD)
techniques are used for analyzing and
discovering actionable insights from data.
• The talk will
– Provide technical descriptions of the core algorithms
that comprise data mining analytics
– Describe some business application scenarios for
KDD
– Discuss issues in business intelligence systems
– Map trends in this area
Background
– Widespread and explosive growth in use and size of
databases
• Traditional use: query based report generation
– Size and volumes raise new issues:
• will data help business to achieve an advantage
• can data be used to model underlying processes and predict
their behavior
• can we understand the data
– Providing capabilities to support exploration,
summarization, and modeling of large databases is
the goal of Business Intelligence systems
1
From Transactions to Warehouses
• Transactional databases:
– Reliable and accurate data capture; logging, book -keeping
• Data warehousing:
– Turning transactional data into a history repository
• Can be queried for summaries and aggregate reports
– First step in transforming transactional data (primary purpose:
reliable storage) to one whose primary use is business
intelligence
– May require integration of multiple sources of data
• Dealing with multiple formats; multiple database systems;
integrating distributed databases; data cleaning; creating
unified logical view of underlying non-homogeneous data
On-Line Analytical Processing
(OLAP)
• Supports query driven exploration of the data
warehouse
– Utilizing pre-computed aggregates along data
dimensions
– Deciding which aggregates to pre-compute and how
to derive or reliably estimate from pre-computed
projections
– Extends the Structured Query Language (SQL)
framework to accommodate queries that would
otherwise have been computationally impossible on a
relational database management system
Beyond OLAP
• Supporting queries at much more abstract level than
SQL and OLAP
– Computer-driven exploration of data as opposed to human
analyst -driven
– Facilitating data exploration of high dimensional data
• Providing solutions when user cannot describe goal in
terms of a specific query
– e.g. discovering fraudulent cases in credit card or telephone
uses
• Visualizing and understanding massive volumes of highdimensional data
• Rates of growth of data sets exceed by far any rates with
which traditional human analyst techniques can cope
2
A Definition for Data Mining
• Automated search procedures for discovering credible
and actionable insights from large volumes of high
dimensional data
– emphasis upon
• symbolic learning and modeling methods (i.e. techniques that
produce interpretable results)
• data management methods
– use of techniques from statistics, pattern recognition, and
machine learning
• machine learning and statistical modeling also heavily used
in vision, speech recognition, image processing, handwriting
recognition, natural language understanding, etc.
– issues of scalability and automated business intelligence
solutions drive much of and differentiate data mining from the
other applications of machine learning and statistical modeling
Machine Learning and Statistical
Modeling serve as an important core
Temporal Modeling
Complex Pattern Detection
Systems
Performance
Management
Feature Creation and Analysis
Markov Modeling
Data Management
Business Decision Support Systems
Knowledge Discovery
and Data Mining
Speech
Understanding
Machine Learning
Statistical Modeling
Computational Linguistics
Statistical Text Processing
Handwriting
Recognition
Vision / Image
Knowledge Management
NLP/NLU
Others
(Agents, Education,
etc..)
Typical Business Intelligence Applications
• Risk Analysis
– Given a set of current customers and their finance/insurance
history data, build a predictive model that can be used to class ify
a new customer into a risk category
• Targeted Marketing
– Given a set of current customers and history on their purchases
and their responses to promotions, target new promotions to
those most likely to respond
• Customer Retention
– Given a set of past customers and their behavior prior to leaving,
predict who is most likely to leave and take proactive action
• Fraud Detection
– Detect fraudulent activities either proactively or on-line real-time
• Many other new applications keep surfacing
3
There’s More to it Than Just Data Mining
• The process of identifying valid, novel, potentially useful,
and understandable patterns in data requires one or
more of:
–
–
–
–
–
Selecting or sampling data from a data warehouse
Cleaning or pre-processing it
Transforming or reducing it
Applying a data mining component to extract models or patterns
Evaluating the derived structure
• The process is also known as KDD (Knowledge
Discovery from Data Mining)
– Data mining is a key component concerned with the algorithmic
means by which structures are extracted from data while
meeting computational efficiency constraints
The KDD Process
Identify
Business
Opportunity
Select
Data
Warehouse
Transform
Mine
Assimilate
Selected
Data
data mining n. The process of extracting valid, previously
unknown, and ultimately comprehensible and actionable
information from large databases and using it to make
crucial business decisions.
Visualization
Data Mining Techniques
• Predictive Modeling
– Predict a specific attribute (database field) based upon the oth er
attributes (fields) in the data
• Clustering (Segmentation)
– Group data records into subsets where items in subsets are more
similar to each other than to items in other subsets
• Frequent Patterns
– Find interesting similarities between a few attributes in subsets of the
data
• Change & Deviation
– Detect and account for interesting sequence of information in data
records
• Dependencies
– Generate the joint probability density function that might have generated
the data
4
Predictive Modeling
• Estimate a function ? that maps points from an
input space ? to an output space ? ??given only a
finite sampling of the mapping
– Predict value of field (? ??in a database based on the
other fields (? ?
• Accurately construct an estimator ƒ’ of ƒ from a finite sample
known as the training set
– May be corrupted (i.e. noisy)
• If predicted quantity is numeric (i.e. ? ?R, the real line) then
the prediction problem is that of regression modeling
• If the predicted quantity is discrete (i.e. ??????????? ) then the
prediction problem is that of classification modeling
Issues in Predictive Modeling
• Transformations on input space X to improve
estimation capability
– Feature extraction / construction / selection
• Evaluating the estimate ƒ’ in terms of how well it
performs on data not present in the training set
– Maximizing prediction accuracy by avoiding underfitting or over-fitting
• Trading off model complexity versus model
accuracy
– Bias-variance tradeoff, penalized likelihood, minimum
message length (MML) or minimum description length
(MDL)
Classification
• Predicting the most likely state of a categorical
variable (the class) given the values of other
variables
– Density estimation problem: deriving the value of Y
given x?? from the joint density on Y and ?
• Kernel density estimators
• Metric-space based methods (k-nearest neighbor)
• Projection into decision regions – divide attribute space into
decision regions and associate prediction with each region
– Linear classifiers, neural networks, decision trees, disjunctive
normal form (DNF) rule-based classifiers
– Projection methods by far the most practical for data
mining
5
Regression
• Predicting the most likely value of a numerical variable
(the target column) given the values of other variables
– Numerical function approximation problem: deriving the value of
Y given x?? from the joint probability distribution on Y and ?
• Statistical probability models (e.g. linear regression)
• Projection into decision regions – divide attribute space into
decision regions and associate constant value with each
region
– Neural networks, decision trees, disjunctive normal form
(DNF) rule-based classifiers
• Hybrid
– Coupling projection methods with statistical models
– Projection and hybrid methods by far the most practical for data
mining
The Predictive Modeling Process
Mine historical data to train
patterns/models that can predict future
behaviors
Data
Behaviors
Response to Direct Mail
Product Quality (Defects)
Declining Activity
Credit Risk
Delinquency
Likelihood to buy specific
products
Profitability
etc.
Score with models to reflect likelihood
to exhibit the modeled behavior
Act to optimize business objectives
based on these scores
120
120
100
100
80
80
60
60
40
40
20
20
0
0
0 10 20 30 40 50 60 70 80 90 100
5 15 25 35 45 55 65 75 85 95
Decision Trees for Predictive Modeling
• Tree generation algorithm
– Beginning with the training set at the root
node, recursively split until a stopping criteria
is met
• Split using “best” test among all possible tests on
all attributes
• Prune tree (MDL, cross-validation, etc.)
6
Issues in Decision Tree Building
• Splitting at nodes
– greedy search: GINI (entropy minimizing), class probability
profile difference, log-loss likelihood, etc.
– exhaustive search: ReliefF, Contextual Merit, etc.
• Testing of attributes
– Numerical attributes: inequality tests on cut -points
– Categorical attributes: subset tests
• Leaf models
– Piecewise constant
– Linear
– Probability functions
• Scalability
A Typical Decision Tree
Expected Sales Revenue
Historical Sales per Mlg
Less than 7
Hist Avg Sales per Order
Less than 113
Segment 1
Historical Sales per Mlg
7-15
Hist Avg Sales per Order
Greater or equal to 113
Climate Indicator
0
Segment 2
Credit Limit
Less than 2200
Segment 6
Historical Sales per Mlg
Greater than 15
Segment 8
Credit Limit
Greater or equal to 2200
Segment 7
Climate Indicator
1
Risk Score
Less than 687
Segment 3
Risk Score
Greater or equal to 687
Outdoor Purchases
Less than 3
Segment 4
Outdoor Purchases
Greater or equal to 3
Segment 5
Training a Predictive Model
Observations
Predictive
Model
Predicted
Outcomes
Prediction
Errors
Actual Outcomes
7
Training Generalization
Training
Validation
Too many
segments
Over fit
Too few
segments
Under fit
About right
About right
Optimal Training
Prediction Error
Validation
Error
Training
Error
Optimum
Rule Set
Number of Rules
(degrees of freedom)
Clustering
• Given a finite sampling of points, group
them into sets of similar points
– Representing clusters of points with common
characteristics
• In predictive modeling, class (or value)
membership is known in the training data
– In clustering, this knowledge is not known apriori, and is perhaps being discovered by
clustering or segmentation
8
Techniques for
Clustering/Segmentation
• Two-stage approach
– outer loop to determine cluster number k
– inner loop to fit points to clusters
• Metric distance-based methods; find best k-way partition so
that points in a partition are closer to each other than to
points in other partitions
• Model based methods: a best fit (very typically probabilistic)
model is hypothesized for each cluster
• Partition based methods: iteratively enumerating and scoring
various partition scenarios using heuristic scoring functions
k-Mean Clustering Algorithm
• Widely used in data mining
– Given k cluster centers c 1,j,c 2,j ,…,c k,j at iteration j,
compute c 1,j+1,c 2,j+1,…,c k,j+1
• Cluster assignment: For each i=1,…,m, assign x i to cluster l(i)
such that c l(i),j is nearest to x i
• Cluster Center Update: For l=1,…,k set cl,j+1 to be the mean
of all x i assigned to c l,j
– Stop when c l,j= c l,j+1, l=1,…,k
• Extensions include support for scalability,
efficient placement of initial k means, and
(harder problem) determining the number of
clusters k
Frequent Patterns
• Extracting compact patterns that describe
subsets of data
– Row-wise patterns
– Column-wise patterns
• Association rules: detecting combinations of
attribute values that occur with a minimum level of
frequency (support) and certainty (confidence)
– Scalable algorithms can find all such rules in linear time
under certain conditions of data sparseness
– Rules are not statements about causal effects amongst
attributes, but can still provide useful insights
9
Change and Deviation
• Detecting sequence information, temporal
or otherwise
– Ordering information of transactions (rows) is
utilized
– Under certain conditions of data sparseness,
sequences with desired levels of frequency
and certainty can be computed in linear time
Dependency Modeling
• Detecting causal structure within data
– Causal models
• Discovering probabilistic distributions governing
the data
• Discovering functional dependencies between
attributes in the data
– Techniques
• Density estimation methods
– Expectation maximization
• Explicit causal modeling
– Bayesian networks
Applying Data Mining in Business
Profit
Customer
Satisfaction
Who are the best customers
to sell my products to?
What are the most effective
market segments for my
business?
How do I increase market
share of my products?
How do I reduce my costs
and not impact production?
How do I optimize my
inventory?
Efficiency
10
Mapping Operations into Applications
• Predictive Modeling
– Assigning risk levels to new insurance and financial
contracts
• Clustering / Segmentation
– Identifying distinct market groups in customer
population
• Frequent Patterns
– Market basket analysis (what gets shopped together
in a supermarket)
• Change and Deviation
– Fraud discovery in health claim data
– Discovering shopping patterns over time
Business Application Opportunities
Retail/Distribution
Category management
Merchandise planning
Product management
Production planning/tracking
Insurance/Healthcare
Claims analysis
Provider analysis
Managed care
Outcomes analysis
Telecommunications
Utilities
Industrial customer profiles
Financial analysis
"Bulk Power" analysis
Customer profiles/
segmentation
Product profitability
Demand forecasting
Usage analysis
Budgeting
Financial reporting
Demographics
Product costing
Manufacturing quality and
efficiency
Parts analysis
Transportation
Yield management
Pricing/rate analysis
Logistics
Financial
Cross Industry
Government
Manufacturing
Market basket analysis
Target marketing
Customer segmentation
Customer service
Fraud and abuse
Financial performance
Customer profitability and
segmentation
Products/portfolio profitability
Risk management
Cross-sale analysis
Branch performance
The Data Challenge
Whe
It? re Is
W
Me hat D
an oe
?
s it
t
I Ge
Can
How
It?
Wha
In? t Forma
t Is It
No Single View of Data
Multiple Data Sources
Many Interfaces
Difficult to Access
Different Formats, Platforms
Inconsistencies, Redundancies
11
An Efficient Environment for Data Mining
Enterprise Data Model
Enterprise Warehouse
Operational
Data
Global
Specialized Analysis
Datamarts
Datamart
Data
Mining
Analysis
External
Data
Transaction Data
Transformation (legacy
system context removal)
Meta-data Definition (e.g.
consistent business terms)
Analytical dimension
definition (e.g. time,
policyholder)
Summarization
Aggregation
Identification / Collection
Review
Conversion / Reduction /
Normalization
Representation
The Business Intelligence Process
Access
Transform
Operational &
External Data
Enhancing
Summarizing
Aggregating
Distribute
Staging
Joining From
Multiple
Sources
Populating
On-Demand
Store
Relational
Data
Multiple
Platforms &
Hardware
Find &
Understand
Discover,
Analyze,
Visualize
Information
Catalog
Query
Business
Views
Data
Interpretation
Models
MultiDimensional
Analysis
Automate & Manage
Data Mining
Data Flow & Process Flow
Open Interfaces
Multi-Vendor Support
Data Mining Marketplace Status
• Enabler for business intelligence systems
– Data mining algorithm suites
• Loosely coupled with database technology
– Emphasis on data warehousing followed by
exploratory data mining
– Typically conducted by consultants or in-house
analytic teams
• Issues
– Data warehouse requirements
– Sophisticated analytics requirements
12
Key Challenges and Trends
• Infrastructure
– Enabling transparent and pervasive usage
• Algorithms
– Optimized and robust mining
• Solutions
– Vertically integrated for critical problems
– Enhanced emphasis on the Internet
Infrastructure
• Making data mining transparent
– Database extenders
• e.g. DB2/UDB User Defined Functions for model
training/scoring
• Sufficient statistics (e.g. histograms, counts, samples, etc.)
– Parallel and distributed data mining
• Scalability (sampling and parallelization)
– XML based APIs for database coupling and application
embedding
• Interoperability
• Training and scoring in different environments
– Intelligent or semi-automated data warehousing for mining
• Industry specific templates
• Meta-data mining
Algorithms
• Robust and Automated
– Evaluation metrics
– Automated feature extraction / transformation /
selection
– Discovering relational and hierarchical structures
amongst attributes
– Incorporating prior knowledge to account for costs /
benefits / uncertainty / missing values
– Incremental and on-line mining
– Privacy preserving data mining
– Heterogeneous data mining
13
Solutions
• Business
– Risk management
– Targeted marketing
– Portfolio management
• Systems
– Performance management
• Internet
– Site profiling and performance tuning
– User personalization
Summary
• Data mining is being embedded in vertical
solutions for business intelligence and
decision support
– Data Management
– eCommerce
– Critical Large-Scale Solutions (CRM, etc.)
• Using data mining should eventually
become as easy and pervasive as working
with databases and spreadsheets today
References
• Mathematical Programming for Data Mining:
Formulations and Challenges, by Bradley et al.,
INFORMS Journal on Computing, Volume 11,
No. 3, Summer 1999
• KDD Nuggets
– http://www.kdnuggets.com
• IBM
– http://www.research.ibm.com/compsci/kdd
• http://www.research.ibm.com/dar
– http://www.ibm.com/bi
14