Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Mixture model wikipedia, lookup

Cluster analysis wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

Multinomial logistic regression wikipedia, lookup

Transcript
```KXEN Robust Regression (K2R) Exercises
An “On-line Bookstore” launched a marketing campaign to promote
a book on Art history of Florence. This campaign resulted in a
response rate of 8.8%. Two months later, this company decides to
verify what would have been the impacts of using data mining on
the campaign.
The database contains 50 000 customers names, split into two files:
Data Mining
III – Travaux Pratiques
Nicolas Dulian
[email protected]
■ responses.txt which contains 25.365 records
■ prospects.txt which contains 24.635 records
Responses.txt contains the records used for the Starter of the
campaign. This data set is the one used to create a model. The
model will then be applied on prospects.txt.
David Serre
[email protected]
Two columns can be used as Target Variables, PURCHASE OF ART
HISTORY OF FLORENCE as a binary target and AMOUNT as a
continuous one.
Master MI2 Pro EID - Université Paris 13
FDON
Janvier – Février 2008
2
KXEN Robust Regression (K2R) Exercises
KXEN Robust Regression (K2R) Exercises
Step 2 : Analyzing Results
Step 1 : Building a model
Is the Model safe enough to be used ?
Build a first model explaining and predicting the Art book of
Florence purchasing act. A random selection should be selected as
the cutting strategy. Do not forget that the account number is the
primary key (Storage: String, Type: Nominal, Key = 1)
Analyze the Variables Contributions and outline the profile of a
potential buyer, considering the top 5 variables.
Analyze the Profit Curve.
Save the model.
Let us assume that a mailing costs 5 euros and the book of Art
History Of Florence 35 euros. If the prospect decides to buy the
book, the online bookstore earns 30 euros back.
Did you exclude variables from the modeling process ? Why ?
Find the best percentage of population to target.
Calculate the maximum profit to be made thanks to the Customized
Curve.
Use the Cost Matrix and compare with results obtained from the
Customized curve.
Reduce the number of variables to 7 in the model and saved the
resulting model. Compare and discuss.
3
KXEN Robust Regression (K2R) Exercises
Rebuild a model of 7 variables but using the “Auto-selection”
advanced option. Save the “Intermediate states”. Compare with the
previous models.
4
KXEN Robust Regression (K2R) Exercises
Step 4 : Discuss the benefits of datamining
Step 3 : Applying the Model
Finally, in order to validate the scores and probabilities calculated
by the model built in Step 3, a control for deviation is being
performed on the “prospect.txt” data set. This file contains the
returns of the marketing campaign.
The shop decides now to contact 20% of the customers having not
received the mailing yet, in order to maximize the benefits of this
operation.
The number of variables kept is 7.
Apply the model directly on “prospects.txt” and output the results in
a text file. Use the “probability” output option.
Apply the model directly on “Prospects target values unknown” and
output the results in a text file. Choose the “decision” option when
applying the model.
Provide an estimation of the maximum number of respondents and
compare with real results. What do you conclude ?
Apply the model directly on “Prospects target values unknown” and
output the results in another text file. Choose the “probability”
option when applying the model.
Control deviations on the target and on the explanatory variables.
Especially :
■ Analyze the gain curve with the Apply-in curve.
Thanks to this last output file, could you provide an estimate of the
maximum number of respondents?
■ List all variables and categories of variables likely to deviate
■ Control the deviation of the target variable.
Generate the model in HTML export format, displaying the
probabilities.
Should the model be built back again ?
Update Internal Statistics, discuss figures and results.
Step 5 : Go further …
■ Go back to Step 1… but using the continuous Target instead, the AMOUNT variable.
5
08_FDON_3
6
1
KXEN Smart Segmenter (K2S) Exercises
KXEN Smart Segmenter (K2S) Exercises
The online bookstore company decides then to paint a broad picture
of its Art history of Florence clients. To do that, the company is
building a clustering to improve their understanding of these book
Step 2 : Targeted Clustering
As a second step, the company decides to orient the way its
customers should be spread within clusters by using the variable
“Art history of Florence” as a target.
Step 1 : Untargeted Clustering
Build a clustering of 5 clusters. Exclude the “Amount” variable in the
modeling process. Do not check the SQL expressions.
As a first step, the company decides not to use a target variable
within the building of its clustering. The data set to build the model
on is responses.txt.
Analyze the model robustness and quality. Improve its overall
robustness and quality.
Analyze clusters, select the 2 most “relevant” ones.
Build a clustering of 5 clusters. Exclude the “Amount” variable in the
modeling process. Do not check the SQL expressions. Is this a good
model ?
Build a new clustering but add the SQL expressions in the modeling
process. Compare results between models built with or without SQL
expressions
Which are the best clusters of customers ?
Build a clustering of 4 then 6 and 7 groups of individuals. Compare
KXEN indicators. Which model would select ?
Build the same clustering but checking now the “SQL expressions”
box. Which are the best clusters of customers ?
Which are the “best” clusters of customers ? Describe them.
Which clusters of customers is the most likely to contain potential
“Art history of Florence” book buyers ?
7
KXEN Smart Segmenter (K2S) Exercises
8
KXEN Time Series (KTS) Exercises
A large discount Retailer needs to know the Sales forecasts within
the coming next 12 months to anticipate its stocks supplying. Data
cover monthly sales amounts from 1997 to 2004. The retailer will
use KXEN time Series to identify the trend, the cycles and perform
forecasts.
Step 3 : Compare K2S and K2R results
Discuss the differences between a segmentation and a classification.
In which cases would you choose one or the other ?
The file RetailMonthlySales_97-04.csv contains monthly sales
amounts from 1997 to 2004.
Step 1 : Forecasts equal to 6
Run KTS with the default values and provides the forecasts for the
next 6 periods.
Present and explain the components of the final model (Trend,
Cycle, Fluctuation) and discuss its accuracy ?
Generate the forecasts for the next 6 months and compute the
confidence interval for each predicted value.
9
KXEN Time Series (KTS) Exercises
KXEN Association Rules (KAR) Exercises
Step 2 : Forecasts equal to 12
a client to the purchasing act. The goal here is to figure out which
transactions between web pages are most likely to lead people to
Re-run KTS with a number of forecasts equal to 12. What are the
new model and its accuracy ?
Generate the forecasts for the next 12 months and compute the
confidence interval for each predicted value.
10
Data sets to be used for this scenario are the following :
References ; website_references.csv. This is a comma-separated file
containing the identifiers for each session. Each of these keys,
named SessionID represents one visit on the website.
What is the value of the Maximum Horizon ?
Transactions : website_transactions.csv. This file reports each page
accessed. Each transaction is defined by a session Id, the Web page
accessed, the IP address and the date/time of the session.
11
08_FDON_3
12
2
KXEN Association Rules (KAR) Exercises
KXEN Event Log (KEL) Exercises
A bank is launching a marketing campaign to promote a new
product. They decided to mail their customers at different dates
(due to their mailing capacity). On top of this information, the bank
decided to use the ‘credit card usage’ information.
Step 1 : Defining a relevant KAR data set
Run KAR with the default threshold values and without the
Sequence mode. Are all the rules found useful and relevant ?
Run once again KAR but in order to limit the number of rules
generated, check the Sequence Mode Box and let KXEN build all
possible rules with a minimum support of 50.
To build the underlying model, the bank has to its disposal :
■ A file displaying “static” information for each customer involved in the “Starter”
campaign (customer.txt), where the variables representing the purchase of the
product are called mail_answer and mail_amount.
Limit then these rules to those rules whose KI is greater than 0.5.
■ A “transactions” file, named CardUsage.txt, listing all transactions made.
Tune the thresholds values to detect rules implying the
“PURCHASEOK” items.
■ This bank decides to enrich its analytic dataset with customers’ credit card
transactions that have been made before the sending of the mailing.
Step 2 : Select Relevant Rules
Re-run KAR with a minimum support of 5 and display all rules
whose antecedent is “/shop/bodyKidsstuff.tmpl”. Could you identify
a rule implying PURCHASEOK ? Check the Sequence mode box.
13
KXEN Event Log (KEL) Exercises
14
KXEN Sequence Coder (KSC) Exercises
Step 1: Building an Analytic Dataset and running a model
An e-business company decided to rethink of its web site, in such a
way to improve its overall profits. To do that, the company needs to
and which sequence of web pages are most likely to influence this
purchase.
Use KEL to join both tables, on CustomerID, building 4 quarterly
aggregates until the dates of mailing.
Define and test as many new KEL variables as you want to build a
robust and good segmentation model. The results of the campaign
are stored into the variables mail_answer and mail_amount.
Exclude the continuous target “mail_amount” from the modeling
process
The company is using KXEN Sequence Coder in combination with
KXEN Robust regression to detect such sequences.
■ A data file, named web_session, containing one day worth of sessions log, where
the variable representing the purchase act is Purchase.
Step 2: Analyzing the Results
■ A data file detailing customers ‘sessions, web_category_view
Analyze the robustness
Analyze the variables contributions
Analyze the profit curve
Step 3: Impact of KEL
Rebuild the classification model without involving transactional data
and then compare the results.
What do you conclude ?
15
KXEN Sequence Coder (KSC) Exercises
KXEN Data Manipulation Exercises
The goal of this exercise is to let you build step by step an analytical
data set.
Step 1: Run KSC and K2R in combination
The join column is Session_Id for both data sets. We will calculate
the intermediate sequences and keep only 40% of the information.
The statistics to calculate for the category_Id variable are Count
and CountTransition.
A few prerequisites have to be met :
Run KSC followed by K2R and use PURCHASE as the target
variable.
Analyze carefully the results…anything suspicious?
Solve the issue and rebuild the model.
Step 1: Select fields to build your analytical dataset
Step 3: Impact of KSC
Ensure first to have set up an ODBC connexion on the
“DataManip.mdb” MSAccess database.
■ If your computer does not hold an MSAccess application, import the 4 text files as
tables in another ODBC compliant database.
Step 2: Analyzing the Results
16
Select the individual_info table as the reference one to start
Link all tables with individual info and select all fields. Uncheck the
external keys.
Rebuild the classification model without involving transactional data
and then compare the results.
What do you conclude ?
17
08_FDON_3
18
3
KXEN Data Manipulation Exercises
KXEN Data Manipulation Exercises
Step 2: Create new variables and filters
Step 4: Enrich your data manipulation with behavioral data
Create now your target variable. The target variable will be called
“Class” and it will refer to the individual yearly incomes. This
variable will take the value 1 if the individual earns more than 50
000\$ a year, 0 otherwise.
Define a new data manipulation, based on your first one.
Choose the tab “AGGREGATES”. This tab allows adding aggregates
similar to those automatically created by the KEL component. Build
a new aggregate based on the COUNT of the number of calls (count
on DAT). (transactions are stored into the calls table).
Create a new filter on the variable AGE, limiting your dataset to
individuals older than X years old. The value of X will be userdefined when
Save your changes into another data manipulation and repeat STEP
3.
Step 3: Save and Use your Data Manipulation