Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KXEN Robust Regression (K2R) Exercises An “On-line Bookstore” launched a marketing campaign to promote a book on Art history of Florence. This campaign resulted in a response rate of 8.8%. Two months later, this company decides to verify what would have been the impacts of using data mining on the campaign. The database contains 50 000 customers names, split into two files: Data Mining III – Travaux Pratiques Nicolas Dulian [email protected] ■ responses.txt which contains 25.365 records ■ prospects.txt which contains 24.635 records Responses.txt contains the records used for the Starter of the campaign. This data set is the one used to create a model. The model will then be applied on prospects.txt. David Serre [email protected] Two columns can be used as Target Variables, PURCHASE OF ART HISTORY OF FLORENCE as a binary target and AMOUNT as a continuous one. Master MI2 Pro EID - Université Paris 13 Data mining et Business Intelligence FDON Janvier – Février 2008 2 KXEN Robust Regression (K2R) Exercises KXEN Robust Regression (K2R) Exercises Step 2 : Analyzing Results Step 1 : Building a model Is the Model safe enough to be used ? Build a first model explaining and predicting the Art book of Florence purchasing act. A random selection should be selected as the cutting strategy. Do not forget that the account number is the primary key (Storage: String, Type: Nominal, Key = 1) Analyze the Variables Contributions and outline the profile of a potential buyer, considering the top 5 variables. Analyze the Profit Curve. Save the model. Let us assume that a mailing costs 5 euros and the book of Art History Of Florence 35 euros. If the prospect decides to buy the book, the online bookstore earns 30 euros back. Did you exclude variables from the modeling process ? Why ? Find the best percentage of population to target. Calculate the maximum profit to be made thanks to the Customized Curve. Use the Cost Matrix and compare with results obtained from the Customized curve. Reduce the number of variables to 7 in the model and saved the resulting model. Compare and discuss. 3 KXEN Robust Regression (K2R) Exercises Rebuild a model of 7 variables but using the “Auto-selection” advanced option. Save the “Intermediate states”. Compare with the previous models. 4 KXEN Robust Regression (K2R) Exercises Step 4 : Discuss the benefits of datamining Step 3 : Applying the Model Finally, in order to validate the scores and probabilities calculated by the model built in Step 3, a control for deviation is being performed on the “prospect.txt” data set. This file contains the returns of the marketing campaign. The shop decides now to contact 20% of the customers having not received the mailing yet, in order to maximize the benefits of this operation. The number of variables kept is 7. Apply the model directly on “prospects.txt” and output the results in a text file. Use the “probability” output option. Apply the model directly on “Prospects target values unknown” and output the results in a text file. Choose the “decision” option when applying the model. Provide an estimation of the maximum number of respondents and compare with real results. What do you conclude ? Apply the model directly on “Prospects target values unknown” and output the results in another text file. Choose the “probability” option when applying the model. Control deviations on the target and on the explanatory variables. Especially : ■ Analyze the gain curve with the Apply-in curve. Thanks to this last output file, could you provide an estimate of the maximum number of respondents? ■ List all variables and categories of variables likely to deviate ■ Control the deviation of the target variable. Generate the model in HTML export format, displaying the probabilities. Should the model be built back again ? Update Internal Statistics, discuss figures and results. Step 5 : Go further … ■ Go back to Step 1… but using the continuous Target instead, the AMOUNT variable. 5 08_FDON_3 copyright KXEN 6 1 KXEN Smart Segmenter (K2S) Exercises KXEN Smart Segmenter (K2S) Exercises The online bookstore company decides then to paint a broad picture of its Art history of Florence clients. To do that, the company is building a clustering to improve their understanding of these book buyers. Step 2 : Targeted Clustering As a second step, the company decides to orient the way its customers should be spread within clusters by using the variable “Art history of Florence” as a target. Step 1 : Untargeted Clustering Build a clustering of 5 clusters. Exclude the “Amount” variable in the modeling process. Do not check the SQL expressions. As a first step, the company decides not to use a target variable within the building of its clustering. The data set to build the model on is responses.txt. Analyze the model robustness and quality. Improve its overall robustness and quality. Analyze clusters, select the 2 most “relevant” ones. Build a clustering of 5 clusters. Exclude the “Amount” variable in the modeling process. Do not check the SQL expressions. Is this a good model ? Build a new clustering but add the SQL expressions in the modeling process. Compare results between models built with or without SQL expressions Which are the best clusters of customers ? Build a clustering of 4 then 6 and 7 groups of individuals. Compare KXEN indicators. Which model would select ? Build the same clustering but checking now the “SQL expressions” box. Which are the best clusters of customers ? Which are the “best” clusters of customers ? Describe them. Which clusters of customers is the most likely to contain potential “Art history of Florence” book buyers ? 7 KXEN Smart Segmenter (K2S) Exercises 8 KXEN Time Series (KTS) Exercises A large discount Retailer needs to know the Sales forecasts within the coming next 12 months to anticipate its stocks supplying. Data cover monthly sales amounts from 1997 to 2004. The retailer will use KXEN time Series to identify the trend, the cycles and perform forecasts. Step 3 : Compare K2S and K2R results Discuss the differences between a segmentation and a classification. In which cases would you choose one or the other ? The file RetailMonthlySales_97-04.csv contains monthly sales amounts from 1997 to 2004. Step 1 : Forecasts equal to 6 Run KTS with the default values and provides the forecasts for the next 6 periods. Present and explain the components of the final model (Trend, Cycle, Fluctuation) and discuss its accuracy ? Generate the forecasts for the next 6 months and compute the confidence interval for each predicted value. 9 KXEN Time Series (KTS) Exercises KXEN Association Rules (KAR) Exercises Step 2 : Forecasts equal to 12 An e-business Web site would like to understand behaviors that lead a client to the purchasing act. The goal here is to figure out which transactions between web pages are most likely to lead people to the purchasing act. Re-run KTS with a number of forecasts equal to 12. What are the new model and its accuracy ? Generate the forecasts for the next 12 months and compute the confidence interval for each predicted value. 10 Data sets to be used for this scenario are the following : References ; website_references.csv. This is a comma-separated file containing the identifiers for each session. Each of these keys, named SessionID represents one visit on the website. What is the value of the Maximum Horizon ? Transactions : website_transactions.csv. This file reports each page accessed. Each transaction is defined by a session Id, the Web page accessed, the IP address and the date/time of the session. 11 08_FDON_3 copyright KXEN 12 2 KXEN Association Rules (KAR) Exercises KXEN Event Log (KEL) Exercises A bank is launching a marketing campaign to promote a new product. They decided to mail their customers at different dates (due to their mailing capacity). On top of this information, the bank decided to use the ‘credit card usage’ information. Step 1 : Defining a relevant KAR data set Run KAR with the default threshold values and without the Sequence mode. Are all the rules found useful and relevant ? Run once again KAR but in order to limit the number of rules generated, check the Sequence Mode Box and let KXEN build all possible rules with a minimum support of 50. To build the underlying model, the bank has to its disposal : ■ A file displaying “static” information for each customer involved in the “Starter” campaign (customer.txt), where the variables representing the purchase of the product are called mail_answer and mail_amount. Limit then these rules to those rules whose KI is greater than 0.5. ■ A “transactions” file, named CardUsage.txt, listing all transactions made. Tune the thresholds values to detect rules implying the “PURCHASEOK” items. ■ This bank decides to enrich its analytic dataset with customers’ credit card transactions that have been made before the sending of the mailing. Step 2 : Select Relevant Rules Re-run KAR with a minimum support of 5 and display all rules whose antecedent is “/shop/bodyKidsstuff.tmpl”. Could you identify a rule implying PURCHASEOK ? Check the Sequence mode box. 13 KXEN Event Log (KEL) Exercises 14 KXEN Sequence Coder (KSC) Exercises Step 1: Building an Analytic Dataset and running a model An e-business company decided to rethink of its web site, in such a way to improve its overall profits. To do that, the company needs to better understand what behaviors are leading to the purchasing act, and which sequence of web pages are most likely to influence this purchase. Use KEL to join both tables, on CustomerID, building 4 quarterly aggregates until the dates of mailing. Define and test as many new KEL variables as you want to build a robust and good segmentation model. The results of the campaign are stored into the variables mail_answer and mail_amount. Exclude the continuous target “mail_amount” from the modeling process The company is using KXEN Sequence Coder in combination with KXEN Robust regression to detect such sequences. ■ A data file, named web_session, containing one day worth of sessions log, where the variable representing the purchase act is Purchase. Step 2: Analyzing the Results ■ A data file detailing customers ‘sessions, web_category_view Analyze the robustness Analyze the variables contributions Analyze the profit curve Step 3: Impact of KEL Rebuild the classification model without involving transactional data and then compare the results. What do you conclude ? 15 KXEN Sequence Coder (KSC) Exercises KXEN Data Manipulation Exercises The goal of this exercise is to let you build step by step an analytical data set. Step 1: Run KSC and K2R in combination The join column is Session_Id for both data sets. We will calculate the intermediate sequences and keep only 40% of the information. The statistics to calculate for the category_Id variable are Count and CountTransition. A few prerequisites have to be met : Run KSC followed by K2R and use PURCHASE as the target variable. Analyze carefully the results…anything suspicious? Solve the issue and rebuild the model. Step 1: Select fields to build your analytical dataset Step 3: Impact of KSC Ensure first to have set up an ODBC connexion on the “DataManip.mdb” MSAccess database. ■ If your computer does not hold an MSAccess application, import the 4 text files as tables in another ODBC compliant database. Step 2: Analyzing the Results 16 Select the individual_info table as the reference one to start building your data set. Link all tables with individual info and select all fields. Uncheck the external keys. Rebuild the classification model without involving transactional data and then compare the results. What do you conclude ? 17 08_FDON_3 copyright KXEN 18 3 KXEN Data Manipulation Exercises KXEN Data Manipulation Exercises Step 2: Create new variables and filters Step 4: Enrich your data manipulation with behavioral data Create now your target variable. The target variable will be called “Class” and it will refer to the individual yearly incomes. This variable will take the value 1 if the individual earns more than 50 000$ a year, 0 otherwise. Define a new data manipulation, based on your first one. Choose the tab “AGGREGATES”. This tab allows adding aggregates similar to those automatically created by the KEL component. Build a new aggregate based on the COUNT of the number of calls (count on DAT). (transactions are stored into the calls table). Create a new filter on the variable AGE, limiting your dataset to individuals older than X years old. The value of X will be userdefined when Save your changes into another data manipulation and repeat STEP 3. Step 3: Save and Use your Data Manipulation Save your data manipulation. Build a K2R classification on the CLASS target variable of this data manipulation. 19 08_FDON_3 copyright KXEN 20 4