Download fgdd 1 - Information Builders

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
RStat
Predictive Modeling & Scoring Applications
for Operational BI
Kathy Kendall
Strategic Product Manager
Agenda
ƒ What is Predictive Modeling?
ƒ What are Scoring Applications?
ƒ What is RStat?
ƒ DEMO:
Building a Scoring Application
ƒ Scoring Applications Project Life Cycle
fgdd
1
Predictive Modeling
Examines large volumes of historical data,
2
evaluates statistically,
3
identifies mathematical formulas or sets of rules,
4
which can be applied to new data
5
to predict an outcome – to score.
1
New data can be scored through:
ƒ Scoring Applications
ƒ In-Database Scoring
Copyright 2007, Information Builders. Slide 3
Scoring Applications
Integrate model formulas / rules into web applications that
ƒ provide a UI to identify new data to be scored.
ƒ return the predicted value in a report
report, a graph
graph, a map
map, a
dashboard or a process flow.
In-Database Scoring
PMML (Predictive Modeling Markup Language)
ƒ XML standard for expressing
p
g statistical & data mining
g
models.
ƒ In-Database scoring translates PMML into SQL scripts
that score directly to the database.
Copyright 2007, Information Builders. Slide 4
fgdd
2
What is RStat?
RStat is the first fully integrated environment for creating BI, modeling, and
scoring applications. Its offers:
ƒ Low Cost: Built on the open source R engine, RStat eliminates all statistical
software licensing costs. Organizations pay only maintenance and support.
ƒ Better License Management: The full integration within Developer Studio
allows organizations to scale down the number of other statistical software
li
licenses
th
thatt are used
d primarily
i
il ffor query and
d analysis,
l i ii.e. ffor pure BI ffunctions.
ti
ƒ One Tool For All Users: Having a single BI and modeling tool, allows
organizations to better maintain, manage, and share resources across BI and
statistical projects.
ƒ Top 10 Data Mining Algorithms: RStat includes the most commonly used
statistical and data mining algorithms plus an extensive model evaluation tools
that will satisfy 90% of your enterprise data mining requirements.
ƒ Simple User Interface: RStat’s simple and intuitive user interface allows
organizations to deploy it to more analysis with less training compared to other
statistical packages.
ƒ Deployment
ep oy e t On
O Any
y Platform:
at o
The
eu
unique
que WF scoring
sco g routines
out es ca
can be
deployed on all WF supported platforms giving organizations independence of
expensive statistical servers, thus eliminating any additional software and
maintenance costs.
ƒ Reputation & Extensibility: R is used by over 1MM analysts worldwide, is
taught in many universities, and has over 1000 packaged extensions for many
different types of analysis giving your organization instant access to more
models and techniques than any other statistical software.
Comprehensive List of Data Mining Models
ƒ
ƒ
fgdd
Supervised modeling for classification and prediction: A target
(dependent) variable and a training set are required to build the model.
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Regression – Linear, GLM, Logistic, Poisson and Multinomial
Decision Trees
randomForests (the algorithm name is the same)
Boosting (Ada Boost - algorithm)
Support Vector Machines
Neural Networks (FeedForward Neural Network model)
Unsupervised modeling for classification only: Classification is
generated directly on the original data, without building a model first,
i.e., dependent variables and training data are not required.
ƒ
ƒ
Clustering – both K-means and Hierarchical clustering
Association Rules
ƒ
Hypothesis Testing: T-Tests, Variance tests.
ƒ
Descriptive statistic: Summary statistics, distributions, correlations,
principal components
ƒ
Model evaluation: Confusion table, Risk chart, Lift chart, ROC Curve,
Precision and Sensitivity charts
3
RStat: Scoring Application for Marketing
Using historical &
demographic data
t predict
to
di t
future purchases
Building a Model In RStat
fgdd
4
Model Output
Deploying a Model
fgdd
5
Building a Scoring Application
Scoring Applications – Adhoc Analysis
fgdd
6
Scoring Applications Project Life Cycle: 3 to 9 Months
15%
Business
Requirements
ƒ Business
Objectives
ƒ Background
Information
ƒ Prior History
ƒ Resources
ƒ Modeling
Objectives
ƒ Assumptions
ƒ Constraints
ƒ Risks
ƒ Contingency
ƒ Terminology
ƒ Tools
ƒ Techniques
ƒCriteria for
Success
ƒProject Plan
Project Manager
20%
25%
Data
Assessment
15%
Data
Preparation
Modeling
ƒ Data Extraction ƒ Data Fields
ƒ Modeling
ƒ Data Definitions Selection
q
Technique
ƒ Data Exploration ƒ Data Cleansing Selection
ƒ Data
ƒ Data Quality
ƒDocument
Verification
ƒ Data
Assessment
Report
Construction –
Assumptions for
derived attributes, Modeling
generated records ƒGenerate Test
ƒ Missing Values Design
Treatment
ƒBuild Model
ƒIntegrated Data – ƒAssess Model
Merge and Overlay
ƒ Transform &
Format Data
ƒ Final Data Set
Report
BI Developer -80% BI Developer -80%
Statistician – 20% Statistician – 20%
Statistician
10%
Model
Evaluation
15%
Deployment
ƒEvaluate model ƒApplication UI
g
criteria for Design
g
against
success and
ƒApplication
business goals
Workflow Design
ƒModel
ƒPlatform
Documentation
considerations
ƒModel Approval ƒDevelopment
ƒTesting
ƒFinal Deployment
Statistician
Project Manager
BI Developer
Sources: CRISP-DM (Cross Industry Standard Process for Data Mining)
TDWI Best Practices in Predictive Analytics, Q1 2007.
Thank you!
fgdd
7