Download Oracle Data Mining Case Study: Xerox

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Oracle 10g DB
Data Warehousing
ETL
<Insert Picture
Here>
OLAP
Statistics
Oracle Data Mining
Case Study: Xerox
Session ID: S283051
Data Mining
Charlie Berger
Sr. Dir. Product Management, Life & Health Sciences Industry & Data Mining Technologies
Oracle Corporation
[email protected]
Tracy E. Thieret
Principal Scientist, Imaging and Systems Technology Center
Xerox Innovation Group
Webster, New York
Copyright © 2006 Oracle Corporation
Agenda
Oracle 10g DB
Data Warehousing
1. Oracle Data Mining Overview
ETL
OLAP
Statistics
Data Mining
2. Xerox Case Study
Copyright © 2006 Oracle Corporation
What is Data Mining?
• Process of sifting through massive amounts
of data to find hidden patterns and discover
new insights
• Data Mining can provide valuable results:
• Identify factors more associated with a target
attribute (Attribute Importance)
• Predict individual behavior (Classification)
• Find profiles of targeted people or items
(Decision Trees)
• Segment a population (Clustering)
• Determine important relationships with the
population (Associations)
• Find fraud or rare “events” (Anomaly Detection)
Copyright © 2006 Oracle Corporation
Data Mining: Find hidden Patterns
• Data Mining can find previously hidden patterns and
relationships to help you:
• Make informed predictions and…
• Better understand customers
• Data Mining can help answer questions such as:
•
•
•
•
•
Which customers are likely to churn or attrite?
Which customers are likely to respond to this offer?
Which employees are likely to leave?
What “next product” should I recommend to this customer?
Which factors are most associated with a target attribute e.g. high value
customers
• Which customer or transactions are most “unnatural” or possibly
suspicious?
Copyright © 2006 Oracle Corporation
Data Mining: Discover New Insights
• Data Mining uncover hidden patterns and relationships to
help you:
• Discover new segments, clusters, and subgroups and …
• Data Mining can help answer questions such as:
• What are the profiles subpopulations or items of interest e.g. churners,
profitable customers, defective product, etc.
• What natural segments or clusters exist in my data?
• Which items are typically purchased together?
• What items seems to fail together?
• Which genes are most associated with this disease?
Copyright © 2006 Oracle Corporation
Oracle Data Mining 10gR2
Oracle in-Database Mining Engine
• Oracle Data Miner (GUI)
• Simplified, guided data mining
• Spreadsheet Add-In for Predictive Analytics
• “1-click data mining” from a spreadsheet
• PL/SQL API & Java (JDM) API
• Develop advanced analytical applications
• Wide range of algorithms
•
•
•
•
•
•
•
•
Anomaly detection
Attribute importance
Association rules
Clustering
Classification & regression
Nonnegative matrix factorization
Structured & unstructured data (text mining)
BLAST (life sciences similarity search algorithm)
Copyright © 2006 Oracle Corporation
10g Statistics & SQL Analytics
FREE (Included in Oracle SE & EE)
• Ranking functions
• Descriptive Statistics
• rank, dense_rank, cume_dist, percent_rank, ntile
• Window Aggregate functions
(moving and cumulative)
• Avg, sum, min, max, count, variance, stddev,
first_value, last_value
• average, standard deviation, variance, min, max, median
(via percentile_count), mode, group-by & roll-up
• DBMS_STAT_FUNCS: summarizes numerical columns
of a table and returns count, min, max, range, mean,
stats_mode, variance, standard deviation, median,
quantile values, +/- n sigma values, top/bottom 5 values
• Correlations
• LAG/LEAD functions
• Direct inter-row reference using offsets
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev, count,
ratio_to_report
• Statistical Aggregates
• Correlation, linear regression family, covariance
• Linear regression
• Fitting of an ordinary-least-squares regression line
to a set of number pairs.
• Frequently combined with the COVAR_POP,
COVAR_SAMP, and CORR functions.
Note: Statistics and SQL Analytics are included in Oracle
Database Standard Edition
• Pearson’s correlation coefficients, Spearman's and
Kendall's (both nonparametric).
• Cross Tabs
• Enhanced with % statistics: chi squared, phi coefficient,
Cramer's V, contingency coefficient, Cohen's kappa
• Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon Signed
Ranks test, Chi-square, Mann Whitney test, KolmogorovSmirnov test, One-way ANOVA
• Distribution Fitting
• Kolmogorov-Smirnov Test, Anderson-Darling Test, ChiSquared Test, Normal, Uniform, Weibull, Exponential
• Pareto Analysis (documented)
• 80:20 rule, cumulative results table
Copyright © 2006 Oracle Corporation
In-Database Analytics
Advantages
Oracle 10g DB
• Data remains in the database at all
times…with appropriate access security
control mechanisms—fewer moving parts
• Straightforward inclusion within interesting
and arbitrarily complex queries
• Real-world scalability—available for mission critical
appls
• Enabling pipelining of results without costly
materialization
• Scalable & Performant
Data Warehousing
ETL
OLAP
Statistics
Data Mining
• Real-time scoring 2.5 million records scored in 6 seconds
on a single CPU system
Copyright © 2006 Oracle Corporation
Oracle 10g DB
Data Warehousing
ETL
<Insert Picture Here>
OLAP
Statistics
Oracle Data Mining 10g
D E M O N S T R A T I O N
Data Mining
Copyright © 2006 Oracle Corporation
Oracle Data Mining
Oracle Data Mining provides
summary statistical information
prior to data mining
Copyright © 2006 Oracle Corporation
Oracle Data Mining
Oracle Data Mining provides
model performance and
evaluation viewers
Oracle Data
Mining’s
Activity
Guides
simplify &
automate
data mining
for business
users
Copyright © 2006 Oracle Corporation
Oracle Data Mining
Apply model
viewers
Additional model
evaluation viewers
Copyright © 2006 Oracle Corporation
Example #1:
Simple, Predictive SQL
• Select customers who are more than 60% likely to
purchase a 6 month CD and display their marital
status
SELECT * from(
SELECT A.CUST_ID, A.MARITAL_STATUS,
PREDICTION_PROBABILITY(CD_BUYERS76485_DT, 1
USING A.*) prob
FROM CBERGER.CD_BUYERS A)
WHERE prob > 0.6;
Copyright © 2006 Oracle Corporation
Oracle Data Mining 10g R2
Decision Trees
• Decision Trees
Problem: Find customers
likely to buy a new car and
their profiles
Income
• Classification
• Prediction
• Customer
“profiling”
>$50K
<=$50K
Gender
M
Age
FF
>35
Status
Married
Buy = 0
Gender
Single F
Buy = 1
Buy = 0
M
<=35
HH Size
>4
Buy = 1 Buy = 0
<=4
Buy = 1
IF (Income >50K AND Gender=F AND Status >Single… ), THEN P(Buy Car=1)
Confidence= .77
Support = 250
Copyright © 2006 Oracle Corporation
Oracle Data Mining 10g R2
Anomaly Detection
Problem: Detect
rare cases
• “One-Class” SVM Models
•
•
•
•
•
Fraud, noncompliance
Outlier detection
Network intrusion detection
Disease outbreaks
Rare events, true novelty
X2
X1
X2
X1
Copyright © 2006 Oracle Corporation
Oracle Data Mining
Algorithm Summary 10gR2
Problem
Algorithm
Applicability
Classification
Decision Tree
Popular / Rules / transparency
Naïve Bayes
Support Vector Machine
Adaptive Bayes Network
Embedded app
Regression
Support Vector Machine
Wide / narrow data
Attribute Importance
Minimum Description
Length (MDL)
Attribute reduction
Identify useful data
Reduce data noise
Association Rules
Apriori
Market basket analysis
Link analysis
Clustering
Hierarchical K-Means
Feature Extraction
Wide / narrow data
Rules / transparency
Hierarchical O-Cluster
Product grouping
Text mining
Gene and protein analysis
NMF
Text analysis
Feature reduction
Copyright © 2006 Oracle Corporation
Integration with Oracle BI EE
Oracle Data
Mining reveals
important
relationships,
patterns,
predictions &
insights to the
business users
Create Categories
of Customers
Copyright © 2006 Oracle Corporation
Spreadsheet Add-In for Predictive Analytics
• Enables Excel
users to “mine”
Oracle or Excel
data using “one
click” Predict and
Explain predictive
analytics features
• Users select a table
or view, or point to
data in Excel, and
select a target
attribute
Copyright © 2006 Oracle Corporation
Oracle 10g DB
Data Warehousing
ETL
<Insert Picture Here>
OLAP
Statistics
Oracle Data Miner 10gR2
Code Generation Release
Data Mining
Copyright © 2006 Oracle Corporation
Oracle Data Miner (gui)
10gR2 Summer OTN Release
• PL/SQL code
generation for
Mining Activities
Copyright © 2006 Oracle Corporation
Oracle Data Miner (gui)
10gR2 Summer OTN Release
Copyright © 2006 Oracle Corporation
Analytics vs.
1. In-Database Analytics Engine
Basic Statistics (Free)
Data Mining
Oracle 10g DB
Text Mining
Data Warehousing
1. External Analytical Engine
Basic Statistics
Data Mining
Text Mining (separate: SAS EM for Text)
Advanced Statistics
ETL
OLAP Statistics
2. Development
Platform
Data Mining
Java (standard)
SQL (standard)
J2EE (standard)
3. Costs (ODM: $20K cpu)
Simplified environment
Single server
Security
2. Development
Platform
SAS Code (proprietary)
3. Costs (SAS EM: $150K/5 users)
Annual Renewal Fee
(~40% each year)
Copyright © 2006 Oracle Corporation
Oracle 10g DB
Data Warehousing
ETL
<Insert Picture Here>
OLAP
Statistics
Partners
Data Mining
Copyright © 2006 Oracle Corporation
SAP Business Warehouse
Connector (ODM-BW Connector)
• Seamless
integration for SAP
customers
• Secure
• Data remains in
database
• Single version of
truth
• Easy to use
Copyright © 2006 Oracle Corporation
SPSS Clementine
• NASDAQ-listed, top 25
software company
• 35+ year heritage in
analytic technologies
• Operations in over 60
countries
• More than 95% of FORTUNE
1000 are SPSS customers
• Combine SPSS Clementine
ease of use with ODM
in-Database functionality
& scalability
• Build, store, browse and score
models in the Database for
optimal performance
• For more information :
•
•
•
SPSS – Roger Lonsberry, (312) 651-3475 or [email protected]
Oracle – Alan Manewitz, (925) 984-9910 or [email protected]
Oracle – Charlie Berger, (781) 744-0324 or [email protected]
Copyright © 2006 Oracle Corporation
InforSense -- A Single Optimized Environment for
Real Time Business Analytics within the Database
Deploy the analytic workflow
as a service embedding to
BPEL, SFA, CRM
Oracle
Decision Tree
Model
Oracle Data
Sources
Interact with (visualize) data
at any step in the workflow
InforSenseService
Deployment
Deploy the analytic workflow
as an Oracle Portal
Oracle
Functionalities:
Data Mining
Preprocess
Statistics
Text
OLAP
Scheduler
SAS free analytics: leverage Oracle analytics
SQL free analytics: drag-drop application build
Visual analytics: interactive visualisation
Integrative analytics: unified analytical environment
Automated analytics: deploy to Oracle Portal and BPEL
Copyright © 2006 Oracle Corporation
Oracle Real-Time Decision Engine
For enabling Operational Business Intelligence
Telco
Retail
Fins
Contact
Contact
Center
Center
IVR
IVR
Web
Web
Oracle Real-Time Decision
(RTD) Engine
Campaign
Campaign
Management
Management
Health
ATM
ATM
Travel
Kiosk
Kiosk
Others
Front
Front
Office
Office
Eligibility Prediction / Learning
Engine Scoring Engine Engine
Business
Business
Intelligence
Intelligence
Oracle
OracleData
Data
Mining
Mining
Data
DataWarehouse
Warehouse
Copyright © 2006 Oracle Corporation
Benefits of Oracle’s Approach
In-Database Analytics
Benefit
• Platform for Analytical
Applications
• Eliminates data movement and
security exposure
• Fastest: DataÆInformation
• Wide range of data mining
algorithms & statistical
functions
• Supports most analytical
problems
• Runs on multiple platforms
• Applications may be developed
and deployed
• Built on Oracle Technology
• Grid, RAC, integrated BI,…
• SQL & PL/SQL available
• Leverage existing skills
Copyright © 2006 Oracle Corporation
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
The role of Data Mining
in Rules-based Remote Services Delivery
Tracy E. Thieret
Principal Scientist
Imaging and Systems Technology Center
Xerox Innovation Group
Webster, New York
Talk Track
• Introduction to Xerox
• Business Metrics and Requirements
• How do we get data from our devices?
• OK, we have data. Now what?
• Before you can do Data Mining…
• The Process and some Results
• The Rewards
Oracle OpenWorld: October 2006
2
Xerox Innovation Group Locations
XRCC
PARC
*
El Segundo*
Oracle OpenWorld: October 2006
** WCR&T/ISTC
* Stamford
XRCE
*
3
Introduction to Xerox
It’s all about Documents
„ Copying
„ Format
and Printing
conversion – electronic to paper and back
How do we make money?
„ Engineering
„ Chemistry
„ Services
design of marking products
and Physics of Materials
around Marking and Scanning
Oracle OpenWorld: October 2006
4
Some Xerox Engineered Products
Full Range: Desktop to Production
Phaser 6250
Nuvera
DocuColor iGen3
WorkCenterPro 90
Oracle OpenWorld: October 2006
5
Document System Device
Modern document devices perform multiple functions
„ Print, Scan, Copy, Fax, E-Mail, OCR, and much more…
„ Simultaneously
Complex Electro-Mechanical devices
„ DocuColor iGen3 Digital Press
‹ 85 Computers, 5M Lines of S/W, 3.5 miles of wires, 192 Sensors, 102 Motors
Process Control
Loops
Steering & Motion
Control Loops
Smart Paper Trays
Print
Engine
Oracle OpenWorld: October 2006
Stacking
6
US Consumables Industry
$~37 Billion Annually
Photoreceptors
220,000/day
Toners/Carriers
25,000 Freight Cars/Year
Fuser Components
35 Million Rolls/Year
Paper & Transparencies
4 Billion Sheets/Day
Copying
Copying
Printing
Printing
Faxing
Faxing
Specialty Materials
•Fuser Oil
•Cleaner Blades
Oracle OpenWorld: October 2006
Inks
490,000 Cartridges/Day
7
Toner
A Highly Complex and Constrained Material
20 µm
Oracle OpenWorld: October 2006
8
Business Objectives:
Reducing Costs in each LoB
Engineering Design
Providing Increased Functionality within Boundaries
„ Total Manufacturing Costs
„ Software Development
Toner Chemistry and Physics
„ New Designs
„ Improved Functionality
Services Delivery
„ Xerox’s Internal Service Force
„ Parts and Labor
„ Accelerate Collective Learning
Oracle OpenWorld: October 2006
Convergence to Mature Metrics
Measure of “Goodness”
„
Product
Launch
$
Time…
Product
EoL
9
Data from Devices
Make use of external and
internal standards to speed
development and deployment
of capabilities.
Network/Systems Mgmt App
On-site Solutions
Xerox & Partner Sites
Customer Site
n
io
t
a
m w
r
fo
In Flo
Inf
or
m
Flo atio
w n
Web Presence
and Back-Office
Devices
Information
Flow
Active Device Agent
with embedded
intelligence
Oracle OpenWorld: October 2006
Enhanced web
access to tools and
services
10
OK – we have the data. Now what?
Deliver Data-Centric Services to Customers
„ AMR:
Automated Meter Reading
„ ASR: Automated Supplies Replenishment
„ Others in the Pipe
Feed-forward to Service Reps for Repair Hints
Knowledge Development in Engineering
Oracle OpenWorld: October 2006
11
Focus on Break-Fix Service
A host of Questions before you start…
„ How to deploy Knowledge to Field Personnel?
„ Knowledge Representation?
„ Transparency?
„ Ease of Knowledge Development?
„ Decoupling of Cycle Times?
Machine Software Releases
z Knowledge Discovery
z
Rules
„ Out
of Favor in the ’80s
„ Back
in with deployment of Business Rules
Oracle OpenWorld: October 2006
12
Where do the Rules come from?
From the knowledge of the experts
„ Interviews
with Engineering and Service Reps.
„ Computational
„ “Same
Capture and Analysis
problem, Different Machine”
Fast cycle time discovery of hidden knowledge
„ Data
Mining
‹
Many algorithms that deliver rule ready results
‹
Triage the rules with the SMEs before deployment
‹
Test for effectiveness in the field
Oracle OpenWorld: October 2006
13
Competitive Benchmarking
Our Choice
Oracle OpenWorld: October 2006
14
Detecting the Unexpected
Oracle OpenWorld: October 2006
15
Before you can do Data Mining…
Domain
Analysis
Repeat
as
necessary
• Business Hypotheses – What problem are you trying to address?
• Cost/benefit modeling
• Domain Knowledge Acquisition
Target
Target
Data
Data Set(s)
Set(s)
• Assemble Relevant Data Sources
• Business Processes
• Numerical and Textual
Data
Data
Preprocessing
Preprocessing
• SQL to summarize/aggregate data
• Pre-computed fields
Data
Data
Reduction
Reduction
Data
Data Mining
Mining
Task
Task Selection
Selection
Algorithm
Algorithm
Selection
Selection
Data
Data Mining
Mining
• Find useful variables
• Classification: Identify clusters that describe behaviors
• Association: What variables describe the problem?
• Statistical methods, decision
trees, Bayesian nets, …
• Search for patterns
• Discover knowledge
Data Mining Expertise can be utilized for:
• Product and Architecture Decisions
• Post Launch Product Improvement
• Expand revenue opportunities
• Post Sale Services Improvement
• Customer Relationship Management
Interpretation
Interpretation
of
of Results
Results
• Explain mined patterns
• Quantify correlations, create rules
• Triage with SMEs
Deploy
Knowledge
• Tools & documentation
• Reports and proposals for Business Decisions and Implementation
• Rollout & Feedback – Quantify Benefits
Oracle OpenWorld: October 2006
16
Rewards of Data Mining
The Pleasure of Finding New Things that Matter
Corporate Financial Benefits
Personal Financial Benefits
Oracle OpenWorld: October 2006
17