Download Data Mining Overview

Document related concepts
no text concepts found
Transcript
Data Mining Overview
James E. Parry
Solution Architect
IBM
Business Analytics software
© 2010 IBM Corporation
Business Analytics software
Introduction to SPSS, an IBM Company
Leadership in Predictive Analytics
– Market leader – 40+ year heritage in predictive
analytic technologies
– Broad product range
• Statistics, data-mining, data collection and
deployment product families
A “first-to-market” deployment methodology
– A methodology for deploying Predictive Analytics across
the enterprise
– Provides an incremental, phased approach to the
enterprise solution
– Based on the convergence of analytics, architecture and
business processes
“Play well with others”
– Non-intrusive integration (Service Oriented Architecture)
– Database-agnostic
– Leverages existing operational software & IT investments
© 2010 IBM Corporation
Business Analytics software
Agenda
What is Predictive Analytics?
Questions Data Mining Can Answer
Statistics vs. Data Mining
Analysis Tools in the Data Mine
– User Driven vs. Data Driven Tools
Supervised vs. Unsupervised Learning
– Supervised: Prediction and Classification
– Unsupervised: Clustering, Association and Anomaly Detection
Text Mining
Deployment Technology: Making Findings Matter
Q&A
© 2009 SPSS Inc.
© 2010 IBM Corporation
Business Analytics software
Predictive Analytics
Predictive analytics helps connect data to effective action by drawing reliable
conclusions about current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
© 2009 SPSS Inc. – Confidential
4
© 2010 IBM Corporation
Business Analytics software
Questions Data Mining Can Answer
Commercial
CommercialSector
Sector
Public
Public Sector
Sector
Reducing campaign costs and
increasing customer conversions
Reducing recruiting costs and
increasing employee retention
Decreasing customer churn
Decreasing institutional attrition
Reducing fraud and improper
payments
Reducing fraud and improper
payments
Maximizing ROI on direct
marketing campaigns
Maximizing ROI on public
service campaigns
Improving product offerings by
understanding customer needs
Improving public health and
safety by understanding
constituent needs
© 2010 IBM Corporation
Business Analytics software
What kinds of questions can you answer with Data Mining in
Public Sector?
It’s all about propensity . . .
– Propensity to . . .
Be a successful
employee
Network Attack
Commit
Fraud
Quit
© 2010 IBM Corporation
Business Analytics software
In other words…
What are the
?
© 2010 IBM Corporation
Business Analytics software
How do we figure those propensities out again?
© 2010 IBM Corporation
Business Analytics software
How do we figure those propensities out again?
You need predictors…
© 2010 IBM Corporation
Business Analytics software
How do we figure those propensities out again?
You need outcomes…
© 2010 IBM Corporation
Business Analytics software
How do we figure those propensities out again?
But you don’t necessarily need to understand complex equations to get answers!
© 2010 IBM Corporation
Business Analytics software
Aren’t those statistics?
Traditional Statistical Data Analysis
–Descriptive (sample)
–Inferential (population)
Data Mining (and machine learning in
general)
–Accuracy of prediction (predicted classification)
–Individual predictions
–Rules of thumb
© 2010 IBM Corporation
Business Analytics software
Aren’t those statistics?
Traditional Statistical Data Analysis
–
Data Mining (and machine learning in
general)
© 2010 IBM Corporation
Business Analytics software
Aren’t those statistics?
Traditional Statistical Data Analysis
–
Data Mining (and machine learning in
general)
Regardless of whether or not the models are easily
explained!
© 2010 IBM Corporation
Business Analytics software
Aren’t those statistics?
Traditional Statistical Data Analysis
–
Data Mining (and machine learning in
general)
Regardless of whether or not the models are easily
explained!
© 2010 IBM Corporation
Business Analytics software
CRISP-DM, the Cross Industry Standard Process for Data Mining
Process Phases
1.
2.
3.
4.
5.
6.
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
© 2010 IBM Corporation
Business Analytics software
CRISP-DM, the Cross Industry Standard Process for Data Mining
© 2010 IBM Corporation
Business Analytics software
Data Mining vs. Statistical Analysis
Statistical Analysis
–Confirm Hypotheses
–More Data Requirements
–More Assumptions
–General Population Predictions
–Cumulative Results
Data Mining
–Generate Hypotheses
–More Exploratory
–Less Data Prep
–Fewer Assumptions
–Individual Predictions
–Results Oriented
User
Driven
Data
Driven
© 2010 IBM Corporation
Business Analytics software
In a nutshell…
Data mining works by…
– Clearly defining business goals
– Data exploration and hypothesis generation
– Training
– Refining and . . .
– Validating models
– Deploying production models into operational framework
Statistics are most useful when…
– You plan an experiment
– You need to plan data collection wisely
• Costly data collection process- minimum cases necessary to find an effect (Power!)
– You need to estimate population parameters
– Confirm or fail to confirm a hypothesis
© 2010 IBM Corporation
Business Analytics software
Analysis tools in the Data Mine
Query, SQL, Spreadsheets
On Line Analytical Processing
(OLAP)
Data visualisation
Statistics
Rule induction and Segmentation
Neural networks & Decision Trees
20 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
Statistics – Descriptive Analysis
Analytic software:
– Data displays
(e.g., frequency distributions)
Satisfaction with service 1-10
– Graphic displays of data
(e.g. histogram)
80
– Measures of central tendency
(e.g., mean, median)
60
– Estimates of variance
(e.g., standard deviation)
Frequency
40
20
Std. Dev = 1.65
Mean = 8.3
N = 248.00
0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Satisfaction with service 1-10
© 2009 SPSS Inc.
21
© 2010 IBM Corporation
Business Analytics software
Statistics – Inferential Analysis
Predicting numerical or
categorical outcomes
– Linear regression
– GLM Multivariate/Repeated
Measures
– Non-linear regression
– Time Series
– Survival Analysis/Cox
regression
– Structural Equation Modeling
© 2009 SPSS Inc.
22
© 2010 IBM Corporation
Business Analytics software
Statistics – Inferential Analysis
Used often in experimental design, clinical trials and survey research
with complex sampling designs
– N.O.R.C. and Gallup use extensive inferential statistics
accurately representing survey data on how people think and feel
about the world today.
– NIH uses inferential statistics to analyze experimental data to
quantify significant differences in treatments and interventions.
– CDC – extensive epidemiological studies require inferential
statistics
Used to create data when you don’t have it.
– Sample size
– Effect size
– Validity of results
© 2009 SPSS Inc.
23
© 2010 IBM Corporation
Business Analytics software
Data Mining
Three classes of data mining
algorithms
Cluster
Supervised vs. unsupervised
“Differences”
Group cases that
exhibit similar
characteristics.
Complementary
What events
occur together?
Given a series of
actions; what
action is likely to
occur next?
Data
Mining
Predict
“Relationships”
Associate
“Patterns”
© 2007 SPSS Inc.
Predict who is likely
to exhibit specific
behavior in the
future.
24
© 2010 IBM Corporation
Business Analytics software
What is Supervised Learning?
A technique when we know the
output or outputs
We will “Supervise” the algorithm
and tell it what we want to
predict.
© 2010 IBM Corporation
Business Analytics software
Supervised Learning: Profile and Predict
Build a predictive profile of the historical
outcome using a collection of potential
input fields.
Credit ranking (1=default)
Cat.
%
n
Bad 52.01 168
Good 47.99 155
Total (100.00) 323
Paid Weekly/Monthly
P-value=0.0000, Chi-square=179.6665, df=1
Weekly pay
Monthly salary
Cat.
%
n
Bad 86.67 143
Good 13.33 22
Total (51.08) 165
Cat.
%
n
Bad 15.82 25
Good 84.18 133
Total (48.92) 158
Age Categorical
P-value=0.0000, Chi-square=30.1113, df=1
Young (< 25);Middle (25-35)
Explores all combinations, interactions
and contingencies.
Cat.
%
n
Bad 90.51 143
Good 9.49 15
Total (48.92) 158
Age Categorical
P-value=0.0000, Chi-square=58.7255, df=1
Old ( > 35)
Cat.
%
Bad 0.00
Good 100.00
Total (2.17)
n
0
7
7
Young (< 25)
Middle (25-35);Old ( > 35)
Cat.
%
n
Bad 48.98 24
Good 51.02 25
Total (15.17) 49
Cat.
%
n
Bad 0.92
1
Good 99.08 108
Total (33.75) 109
Social Class
P-value=0.0016, Chi-square=12.0388, df=1
Management;Clerical
Cat.
%
Bad 0.00
Good 100.00
Total (2.48)
n
0
8
8
Professional
Cat.
%
n
Bad 58.54 24
Good 41.46 17
Total (12.69) 41
Use this profile to understand and
predict future cases.
26
© 2009 SPSS Inc.
© 2010 IBM Corporation
Business Analytics software
Profile and Predict
Neural Networks
–A technique for predicting outcomes based on inputs
where the inputs are weighted on hidden layers
–Behaves similar to the neurons in your brain
–Powerful general function estimators
–Require minimal statistical or mathematical knowledge
27 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
Neural Network Anatomy
28
© 2010 IBM Corporation
Business Analytics software
Neural Network Output
29
© 2010 IBM Corporation
Business Analytics software
Neural Network Output
30
© 2010 IBM Corporation
Business Analytics software
Neural Network Output
31
© 2010 IBM Corporation
Business Analytics software
Neural Network Summary
Excellent for modeling complex relationships and predicting outcomes
– Can handle nonlinearity and interactions with ease
Good for solving many different problem sets (categorical, binary, scale predictors and
outcomes)
Very poor (Black Box) at describing the relationships among predictors and outcomes
32
© 2010 IBM Corporation
Business Analytics software
Profile and Predict
Decision Trees and Rule Induction
–Classification systems that predict or classify
–Technique that shows the ‘reasoning’
– contrast with Neural Network
–Builds sets of easy to understand ‘If – Then’ Rules
–Eliminates factors that are unimportant
33
© 2010 IBM Corporation
Business Analytics software
Basic Decision Tree*
weather
sunny
Temp > 75
BBQ
rainy
cloudy
Eat in
Eat in
windy
no
yes
BBQ
Eat in
*www.cs.utsa.edu/~kwek/cs6463s05/Classification.ppt
© 2010 IBM Corporation
Business Analytics software
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
–Tree is constructed in a top-down recursive divide-andconquer manner
–At start, all the training examples are at the root
–Attributes are categorical (if continuous-valued, they
are discretized in advance)
–Examples are partitioned recursively based on selected
attributes
–Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)*
*www.cs.utsa.edu/~kwek/cs6463s05/Classification.ppt
© 2010 IBM Corporation
Business Analytics software
Decision Tree Anatomy
X1
X2
36
© 2010 IBM Corporation
Business Analytics software
Decision Tree Anatomy
X1
X2
37
© 2010 IBM Corporation
Business Analytics software
Running a Decision Tree
38
© 2010 IBM Corporation
Business Analytics software
Running a Decision Tree
39
© 2010 IBM Corporation
Business Analytics software
Start Here: Claim Amount
40
Decision Tree Output
© 2010 IBM Corporation
Business Analytics software
Why not just use Regression?
• OccPrest = 4(Educ) + 10:
“For every year of education
completed occupational
prestige increases by 4
points on average.”
41
© 2010 IBM Corporation
Business Analytics software
Why not just use Regression?
How do we describe
the relationship?
???
X1
X2
42
© 2010 IBM Corporation
Business Analytics software
Why not just use Regression?
Does an increase in
X2 lead to Green?
???
X1
X2
43
© 2010 IBM Corporation
Business Analytics software
Why not just use Regression?
Can a line describe
something that is not
linear by nature?
???
X1
X2
44
© 2010 IBM Corporation
Business Analytics software
Why not just use Regression?
Many phenomena can
not be fit to a straight
line.
???
X1
X2
45
© 2010 IBM Corporation
Business Analytics software
Decision Trees
Excellent at uncovering and modeling complex relationships
Very accurate on even small data sets to inform decision making.
Can handle nonlinear relationships with complex interactions.
Very easy to understand and describe to others.
Time to insight in minutes.
46
© 2010 IBM Corporation
Business Analytics software
What is Unsupervised Learning?
A data mining technique when
we do not know the output or
outputs
Can be thought of as finding
‘useful’ patterns above and
beyond noise…or “fishing” for
information
© 2010 IBM Corporation
Business Analytics software
Unsupervised Learning:
Clustering and Association
Find emerging patterns and
unusual cases.
Use data mining to examine the
differences and shifts across all
dimensions of the data.
Select large groups to identify
common patterns.
Select small groups to identify
unusual patterns.
48
© 2009 SPSS Inc.
© 2010 IBM Corporation
Business Analytics software
Cluster and Associate
Clustering
– An exploratory data analysis technique
– Reveals natural groups within a data set
– Distance Measure:
No prior knowledge about groups or characteristics
– Not always an end in itself
Associations
– Finds things that occur together – ex: events in a crime incident
– Associations can exist between any of the attributes
(no single outcome like Decision Trees)
Sequential Associations
– Discovers association rules in time-oriented data
– Find the sequence or order of the events
49 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
Anomaly Detection
Anomalies
– Anomaly detection is an exploratory method
– Designed for quick detection of unusual cases or records that should be
candidates for further analysis
– These should be regarded as suspected anomalies, which, on closer
examination, may or may not turn out to be real
50 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
Anomaly Detection- Output
Anomalous Records
– Each record is assigned an anomaly index, ($O-AnomalyIndex) which is the ratio of
the group deviation index to its average over the cluster that the case belongs to.
– The larger the value of this index, the more deviation the case has than the
average.
– Under the usual circumstance, cases with anomaly index values less than 1 or even
1.5 would not be considered as anomalies, because the deviation is just about the
same or a bit more than the average.
– However, cases with an index value greater than 2 could be good anomaly
candidates because the deviation is at least twice the average.
51 SPSS Inc.
© 2009
© 2010 IBM Corporation
Text Mining
Business Analytics software
© 2010 IBM Corporation
Business Analytics software
What is Text Mining?
Most data held within an organization is in the form of
unstructured text documents or records:
–Emails, communications logs,
–Reports,
–Web pages, blogs, …
Text Mining, refers to extracting usable knowledge from
unstructured text data, through identification of core
concepts, opinions and trends, to drive better business
decisions across the enterprise.
53 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
Text Mining Timeline: Text Extraction
“Mr. Smith aka Mr. Ahmed was seen on the corner of Church St. and Magnolia Ave. on Nov 13th”
Bag of « Words »
extraction
Expressions
extraction
Mr.
Smith
aka
was
seen
with
Ahmed
on
the
corner
of
Church
Etc.
70’s
Mr. Smith (Person) -> aka (Alias) -> Mr. Ahmed (Person)
was seen (location) -> Church and Magnolia (address) ->
November 13 (Date)
Citizens
Named Entities
Mr. Smith
was seen
extraction
Mr. Ahmed
corner
Church St.
Magnolia Ave. Mr. Smith -> Person
Nov 13th
Mr. Ahmed-> Person
aka -> Alias
was seen -> location
Church St. -> Address
Magnolia Ave. -> Address
Nov 13th -> Date
80’s
90’s
->us
citizens
->civilians
->civilian bus
-> …
Events/Sentiment
Extraction
Entity Grouping
Build Categories
Now
© 2010 IBM Corporation
Business Analytics software
Discover critical information with TM
55 SPSS Inc.
© 2009
© 2010 IBM Corporation
Business Analytics software
The Deployment Technology
In Data Mining, time to insight is half the battle. Time to
production is the other half (and much more repetitive).
Must be able to ‘deploy’ model into operations:
Quickly
In a standards-based, repeatable fashion
Must be able to monitor model performance for ‘drift’.
56
Automating model performance monitoring and
model refresh decreases errors because it’s a
‘hands off’ operation- no user intervention required.
Automating model refresh guarantees the most
accurate models in the shortest amount of time
(time to production).
© 2010 IBM Corporation
Business Analytics software
Data Mining Considerations
Data
Modeling
© 2009 SPSS Inc.
Batch vs. Real-time
Production Automation
Supervised vs. Unsupervised
Different types of models (NN vs. Rules)
Combining models (Meta modeling)
Deployment
Available data (structured/unstructured)
Relevant factors
Subject matter expertise
Scheduling
Champion – Challenger
Multi-step jobs, conditional logic
Governance
Version control
Security and auditing
57
© 2010 IBM Corporation
Business Analytics software
Data Mining Considerations
Data
Modeling
Scheduling
Champion – Challenger
Multi-step jobs, conditional logic
Governance
© 2009 SPSS Inc.
Batch vs. Real-time
Production Automation
Supervised vs. Unsupervised
Different types of models (NN vs. Rules)
Combining models (Meta modeling)
Deployment
Available data (structured/unstructured)
Relevant factors
Subject matter expertise
Version control
Security and auditing
58
© 2010 IBM Corporation
Business Analytics software
Managing (many) Models
• How do you keep models secure and keep track of their evolution?
• Where do your models sleep at night?
• Who’s model is in production right now?
• Which version is it?
• How do you manage a model once it’s in production?
• How it performs now?
• How it is likely to perform on new data?
• When is it time to retire this model?
59
© 2010 IBM Corporation
Business Analytics software
Collaboration & Deployment
Analytic content management repository
– Version control
– Powerful search
• Analytic awareness
– Security and auditing
Process management
– Multi-step jobs
– Conditional job flow
– Scheduling
– Automated model evaluation
• Champion - challenger
– Open integration
• SPSS tools and non-SPSS tools
Integration & delivery interfaces
– Reporting
– Automatic delivery of analytical output
– Multiple IT infrastructure integration options
• Web services, authentication, and database interfaces
© 2010 IBM Corporation
Business Analytics software
Store, manage, automate, distribute, score ...
Store & manage Modeler artifacts
Streams, data files, output
Search on data-mining metadata
Automate Modeler operations
Stream execution
Support for remote Modeler servers
and clusters
Scheduling and Automation
Model Management
Refresh
Score
Evaluate
Store & distribute output from Modeler
Accuracy
Gains
Accreditation
File based output (cou, html, dat, jpg, etc)
New graph templates from Viz Designer
Batch and Real-time scoring of models
© 2010 IBM Corporation
Business Analytics software
Deployment: Model Refresh
Deployed models can
automatically be refreshed
using the ChampionChallenger scenario . . .
62
© 2010 IBM Corporation
Business Analytics software
Deployment Steps
63
Production considerations . . .
Models should be easily deployed
and managed
No SQL programming necessary
No DBA intervention
Done in a standards-based,
replicable fashion (not one-off)
© 2010 IBM Corporation
Business Analytics software
64
© 2010 IBM Corporation
Business Analytics software
65
© 2010 IBM Corporation
Business Analytics software
Predictions
and
Confidence
66
© 2010 IBM Corporation
Business Analytics software
67
© 2010 IBM Corporation
Business Analytics software
68
© 2010 IBM Corporation
Business Analytics software
69
© 2010 IBM Corporation
Business Analytics software
From Analyst to Production in Minutes
70
Real Time
Prediction
© 2010 IBM Corporation
Business Analytics software
Deployment: Web Application
Deploy into a web application
where cases can be scored to
find unusual attributes . . .
71
© 2010 IBM Corporation
7 software
Business Analytics
2
Deployment
Intelligently route claims and
enhance rules engines with
output from a deployed data
mining model.
Action based
on data mining
model.
Action = REFER
© 2010 IBM Corporation
Business Analytics software
Summary of the Science
• Data Mining is a ‘data’ driven process where control is relinquished to
the machine and many hypotheses are generated and explored.
• Leveraging these insights leads to high ROI.
• Supervised and Unsupervised learning techniques complement each
other
• Clustering and Anomaly Detection are excellent at typifying data with
high dimensionality and finding the ‘needle in the haystack’
• They also serve as great data reduction techniques for data
preparation
• Decision Trees and Neural Nets are great a predicting and
classifying, Decision Trees are generally easier to interpret
• Deployment technology is key to unleashing predictive analytics.
73
© 2010 IBM Corporation
Business Analytics software
Questions?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
© 2010 IBM Corporation