Download Session Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Drew Minkin
[email protected]
◦ Past
 Analytics Architect at Zilliant
 Senior Consultant, Fujitsu
 6+ years Microsoft Services
 Escalation Engineer
 Dedicated Field Engineer (“Alliance”)
 Local speaker for SQL and BI
 OLAP Lecturer, SMU’s BI Graduate Certificate Program
◦ Present
 Business Intelligence Architect at FiServ ISV
 Part time data miner for hire





Data Mining Intro
DM Methodology
Data Concepts
Validating and Testing Models
Applying Output with Scorecards


http://archive.ics.uci.edu/ml/
http://www.kdnuggets.com/




Methodology
Architecture
Information Flow
Technologies






Problem Definition
Data Modeling
Data Discovery
Analytics Modeling
Applied Analytics
Model Validation






Problem Definition
Data Modeling
Data Discovery
Analytics Modeling
Applied Analytics
Model Validation

Business case and non-technical details of
predictive analytics inquiry
◦
◦
◦
◦
◦
Business objectives and success criteria
Requirements, assumptions and constraints
Project plan , Risks and contingencies
Data mining goals and success criteria
Terminology, tools and techniques

Analysis of source data for structural and
content gaps
◦
◦
◦
◦
Data
Data
Data
Data
collection report
description report
exploration report
quality report

Selection and manipulation of source data
into a conformed entity input ready for formal
exploration
◦
◦
◦
◦
Dataset and dictionary and rationale
Data cleansing report
Derived attributes
Generated merged and reformatted data

Research and analysis of patterns and
creation of data mining models
◦
◦
◦
◦
◦
◦
Model
Modeling technique
Modeling assumptions
Test design
Parameter settings
Model description

Testing data mining models using different
algorithms and validation of statistical
significance
◦ Revised Parameter settings
◦ Model Validation plan
◦ Model assessment

Integration of models with new data
◦
◦
◦
◦
◦
Deployment plan
Monitoring and maintenance plan
Final report
Final presentation
Experience documentation

Case – set of columns you want to analyze
◦ Age, Gender, Region, Annual Spending


Case Key – unique ID of a case
A column has:
◦ Data Type
◦ Content Type
◦ And optionally:




Distribution
Discretization
Related Columns
Flags (e.g. NOT NULL)


We don’t care about detailed low-level types
DM only uses:
◦
◦
◦
◦
◦
◦
Text
Long
Boolean
Double
Date
and by some 3rd party algorithms:
 Time, and Sequence

Common:
◦ DISCRETE
 Red, Blue
◦ CONTINOUS
 $6,511.49
◦ DISCRETIZED
 1-5, 6-20, 21+

Denotes a key:
◦ KEY

For special
purposes:
◦
◦
◦
◦
KEY SEQUENCE
KEY TIME
ORDERED
CYCLICAL


Some algorithms interpret this in different
ways, but in general, columns are for:
Input
◦ For predicting another column

PREDICT
◦ These columns are both predicted and act as inputs
for predicting others

PREDICT_ONLY
◦ Not used as input

Columns can be input or predictable or both


When you don’t need to analyze full
continuous range
DM automatically convert data into buckets
◦ By default, into 5

Techniques:
◦
◦
◦
◦
AUTOMATIC
CLUSTERS
EQUAL_AREAS
THRESHOLDS

If you know the distribution of your data (you
should), indicate it:
◦ NORMAL
 Typical Gaussian bell-curve
◦ LOG NORMAL
 Most values at the “beginning” of the scale
◦ UNIFORM
 Flat line – equally likely or perfectly random

Other distributions can exist, but you cannot
indicate them – algorithm will work fine

Nested Case – case containing a table column
◦ Purchases of a Customer

Used for analyzing patterns in a relationship

It has a Nested Key
◦ Not a “relational” foreign key!
◦ Normally, the Nested Key is a column you want to
analyze
 E.g.: Product Name or Model
Algorithms and Use Cases
Association
Rules
Clustering
Decision Trees
Linear
Regression
Logistic
Regression
Naïve Bayes
Neural Nets
Sequence
Clustering
Time Series
Algorithms and Use Cases
Algorithm
Drillthrough
PMML
DM Dimension
Association
Yes
No
Yes
Clustering
Yes
Yes
Yes
Decision Trees
Yes
Yes
Yes
Linear Regression
Yes
No
No
Logistic Regression
No
No
No
Naive Bayes
Yes
Yes
No
Neural Network
No
No
No
Sequence Clustering
Yes
No
Yes
Time Series
Yes
No
No









AVGGIFT
INCOME
LASTGIFT
MAXRAMNT
MINRAMNT
RAMNTALL
WEALTH1
WEALTH2
STATE
Average dollar amount of gifts to date
HOUSEHOLD INCOME
last donation amount
Dollar amount of largest gift to date
Dollar amount of smallest gift to date
Dollar amount of lifetime gifts to date
Wealth Rating
Wealth Rating
State abbreviation (a nominal/symbolic field)



Donor Rank
DOMAIN/Cluster code. A nominal or symbolic field.
could be broken down by bytes as explained below.
◦ 1st byte = Urbanicity level of the donor's neighborhood

U=Urban

C=City

S=Suburban

T=Town

R=Rural
2nd byte = Socio-Economic status of the neighborhood
◦


1 = Highest SES

2 = Average SES

3 = Lowest SES
except for Urban communities,
1 = Highest SES,
2= Above average SES
3 = Below average SES
4 = Lowest SES.
=
http://dejasu.wordpress.com/2008/01/28/knowledge-wisdom-other/question_mark.jpg





www.crisp-dm.org
www.sqlserverdatamining.com
Masao Okada
Rafal Lukawiecki
Eugene A. Asahara
Data Mining in Action :
A Case Study
Drew Minkin (madmanminkin)
Evaluation Links