Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
April 11, 2008
Data Mining Competition 2008
The 4th Annual Business Intelligence Symposium
Hualin Wang ([email protected])
Manager of Advanced Analytics
Retail Marketing Insights
Alliance Data, Columbus, Ohio
April 11, 2008 – Data Mining Competition 2008 Presentation
About Alliance Data
Alliance Data develops data driven solutions that
help partners build lasting relationships with
their customers. As one of the largest providers
of retail and co-brand card services, loyalty and
marketing solutions, payment processing, and
business process outsourcing, we serve the
retail, petroleum, utility, financial services and
hospitality markets.
2
April 11, 2008 – Data Mining Competition 2008 Presentation
Approach Summary
Exploratory Data Analysis
• Identify data issues
• Re-code variables
• Transform variables
• Frequency, UNIVARIATE, BIVARIATE, ANOVA analysis, etc.
Modeling Methodology
• LOGISTIC & PROBIT regression models
• Develop a set of regression models of both types on bootstrapping
samples with a range of weights for responders and non-responders.
Ensemble Models
• Ensemble the set of LOGISTIC & PROBIT models
3
April 11, 2008 – Data Mining Competition 2008 Presentation
Exploratory Data Analysis
Missing Imputation – Substitute missing value with mean, median,
mode, ‘logical’ values, and others based on bivariate results.
Notes: Twenty variables are formatted differently for the training and test datasets. For example,
some variables have value ‘YE’ in one dataset and ‘YES’ in the other. X2 has the value of
HILLSBOROUGH in one set and HILLSBOROUG in the other.
Univariate / Bivariate – Check distributions, extreme values, trend
and other patterns.
Significance Investigation – Conduct contingency table analysis to
understand whether character variables and their levels are significant in
predicting response.
Information Value – Compute information values.
Clustering Analysis – Reveal correlation among numerical variables.
Play the MUSIC gracefully or face it!
4
April 11, 2008 – Data Mining Competition 2008 Presentation
Variable Creation
Capping –
Extreme tails are typically capped to reduce their undue
influence and to produce more robust parameter estimates.
Binning –
Small and insignificant levels of character variables are
regrouped.
Box-Cox Transformations – These transformations are commonly
included, specially, the square root and logarithm.
Johnson Transformations – Performed on numeric variables to
make them more ‘normal’.
Weight of Evidence – Created for character variables and binned
numeric variables.
Interaction –
tree analyses.
Explore possible interactions with the help of decision
5
April 11, 2008 – Data Mining Competition 2008 Presentation
Modeling Methodology
Step 1 – Pick an integer from 3 through 16 and draw 10
bootstrapping samples.
Step 2 – Develop a LOGISTIC model on each sample with
responders’ weight equal to the integer and non-responders’
weight equal to 1.
Step 3 – Average the10 probabilities to produce an ensemble
LOGISTIC model. In this way, we create 14 ensemble
LOGISTIC models, one for each integer from 3 through 16.
Steps 4-6 – Similarly, we obtain 14 ensemble PROBIT models.
Together there are 28 models.
6
April 11, 2008 – Data Mining Competition 2008 Presentation
Ensemble Models
Use each of the 28 models to rank order the
95,960 observations in the test dataset from
95,960 to 1 based on its decreased predicted
probabilities.
The average of the 28 ranks for each observation
is the final score.
7
April 11, 2008 – Data Mining Competition 2008 Presentation
8
What have been considered
throughout the process?
The two judgment criteria: c-statistic
Comapring Alternative Models
Cumulative Responders
& the response rate in the top 10K.
The response rate in the top 10K
requires a model to be able to push
the responders to the top as much as
possible. The rank order capability in
the middle may not be strong. The cstatistic criterion requires a model to
be able to rank order for the whole
population. See the chart on the right.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Model I
Model II
0
1
2
3
4
5
6
7
8
9
10
Model Decile
Modeling methods: There are few options for modeling the response, such as
LOGISTIC models, PROBIT models (or any one in the family), decision trees,
SVM, TreeNet and neural networks. I decided to use the one that I had used
before and was known to work well in a similar situation. Sight difference is
that this time I combined both LOGISTIC and PROBIT models instead of
choosing one over the other.
April 11, 2008 – Data Mining Competition 2008 Presentation
My Experiences
• Play the MUSIC gracefully or face it! It usually
pays off to develop disciplined procedures to
discover and deal with data issues.
• Develop models with different methods and
then combine them. In general, ensemble
models outperform models with any single
method.
• Spend good amount of time on trying to
discover trends, patterns and other true data
relationships. Make good use of them in
modeling.
9
April 11, 2008 – Data Mining Competition 2008 Presentation
Thanks!
Many thanks:
To the Data
Mining Program at University of Central
Florida and BlueCross BlueShield of Florida for organizing
and sponsoring the competition.
Specially to Professor Su for his analytical work and timely responses
to our inquires.
To the 4th Annual Business Intelligence Symposium for
providing the opportunity for us to present and discuss the problem
and the competition.
10