Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KDD Cup Survey Xinyue Liu 1 Outline Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 Summary 2 About KDD Cup A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at: showcase the best methods for discovering higherlevel knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development 3 Statistics Participation in KDD Cup grew steadily, especially requests to access the data Average person-hours per submission: 204 Max person-hours per submission: 910 Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000) 4 Access and Final Participation 200 150 Count Cup 97 Cup 98 100 Cup 99 50 Cup 2000 0 NDA (access to data) Participants Algorithms Decision trees most widely tried and by far the most commonly submitted 5 KDD Cup 97 A classification task – to predict Financial services industry direct mail response Winners Charles Elkan, a PhD from UC-San Diego with his Boosted Naive Bayesian (BNB) Silicon Graphics, Inc with their software MineSet Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System 6 BNB Boosting – to learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds. BNB – representationally equivalent to a multilayer perceptron with a single hidden layer. Complexity – O(ef) e – examples f - attributes 7 MineSet A KDD tool that combines data access, transformation, classification, and visualization. 8 KDD Cup 98 URL: www.kdnuggets.com/meetings/kdd98/kdd-cup- 98.html A classification task – to analyze fund raising mail responses to a non-profit organization Winners Urban Science Applications, Inc. with their software GainSmarts. SAS Institute, Inc. with their software Enterprise Miner. Quadstone Limited with their software Decisionhouse 9 GainSmarts GainSmarts – a feature selection expert system First step - used Logistic Regression to assign each prospect a probability of donation (Pi). Second step - used Linear Regression to estimate a conditional donation amount of responding donors (Ai) Result (<1% error) Prediction = Pi * Ai 10 Enterprise Miner A data mining solution that addresses the entire data mining process SEMMA Process Sample Explore Modify Model Assess Algorithms Decision tree Neural network Regression 11 Decisionhouse Decisionhouse – an integrated modelling software suite by Quadstone Data exploration using visualization modules. Use Decision trees and Scorecards to model more complex tasks. Choose the final model by comparing a variety of modeling approaches and looking at the difference in predicted net profitability (lift curve). 12 Results $70,000 $65,000 $60,000 $55,000 $50,000 $45,000 $40,000 $35,000 $30,000 $25,000 $20,000 $15,000 $10,000 $5,000 $- Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) 100% 90% 80% Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) 70% 60% 50% GainSmarts SAS/Enterprise Miner Quadstone/Decisionhouse 40% 30% 20% 10% 0% KDD Cup 99 URL: www.cse.ucsd.edu/users/elkan/kdresults.html Problem same data set as KDD Cup 98 Winners SAS Institute Inc. with their software Enterprise Miner. Amdocs with their Information Analysis Environment 14 Software SAS – using two-stage model which includes two multi-layer perceptron (MLP) neural networks models. Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model 15 KDD Cup 2000 www.ecn.purdue.edu/KDDCUP/ Sponsored by Purdue University Blue Martini Software 16 Data Set Data collected from Gazelle.com, a legwear and legcare web retailer Pre-processed Training set: 2 months Test sets: one month Data collected includes: 17 Click streams Order information Registration form Problems The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. Questions - When given a set of page views, will the visitor view another page on the site or leave? which product brand will the visitor view in the remainder of the session? characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session? 1. 2. 3. 4. 5. 18 Evaluation Accuracy/score was measured for the two questions with test sets Insight questions judged with help of retail experts from Gazelle and Blue Martini Created a list of insights from all participants Each insight was given a weigh Each participant was scored on all insights Additional factors: Presentation quality Correctness 19 The Winners Question 1 & 5 Winner: Amdocs Question 2 & 3 Winner: Salford Systems Question 4 Winner: e-steam poster 20 Software (Amdocs) Exploratory Data Analysis – SAS Classification Tree – Amdocs Business Insight Tool Decision tree Rules Extraction Modeling Combining models 21 Scheme Yes Bot / Crawler ? No one-click Final Prediction Model Score Model 0 for all Score Model 5 rules (0 or 1) Main Model All Data Yes Score Model 1 for all multi-click Partial Testing Pattern? 22 Hybrid Merged No Score Model Ensemble Best Rule Main Model Decision Tree 5 trees built on 34000 cases Rule Generator 1466 rules 111 continue rules Best Rule Rule 23 Hybrid Hybrid Model Model Merged Merged Rules Rules Sub-models Best rule Hybrid Model Merged Rules 24 Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results. Software (Salford) CART - a decision tree tool that automatically searching for and isolating significant patterns and relationships MARS - a multivariate non-parametric regression procedure HotSpotDetector TreeNet 25 Cart Binary recursive partitioning. Key elements: Splitting rules Brute force search all possible splits for all variables Rank each splitting rule on the basis of a quality-of-split criterion (default GINI) Recursion - split until further splitting is impossible or stopped. Class assignment Plurality rule Assign every node whether it is terminal or not. Pruning Trees – does not stop in the middle Testing - best sub-tree is the one with the lowest error 26 MARs Automatic variable search Automatic variable transformation Automatic limited interaction searches Variable nesting Built-in testing regimens model selection parameters. 27 Insights (Heavy Spenders) Some of the Good insights Referrers - establish ad policy based on conversion rates, not click-throughs Not an AOL user - browser window too small for layout Referring site traffic changed dramatically over time Came to site from print-ad or news, not friends & families Very high and very low income Geographic: Northeast U.S. states Repeat visitors 28 Insights (Who leaves?) Some of the good insights Crawlers, bots accounted for 16% of sessions Long processing time (> 12 seconds) implies high abandonment Referring sites: mycoupons have long sessions, shopnow.com are prone to exit quickly Returning visitors' prob of continuing is double View of specific products (Oroblue,Levante) cause abandonment Probability of leaving decreases with page views Free Gift and Welcome templates on first three pages encouraged visitors to stay at site 29 Insights(Brand view) Some 30 good insights Referrer URL is great predictor: Fashionmall.com and winnie-cooper are referrers for Hanes and Donna Karan mycoupons.com, tripod, deal-finder are referrers for American Essentials Previous views of a product imply later views Summary Data mining requires background knowledge and access to business users Successful data mining solutions combine automated and manual analysis, integrating the power of the machine with expert knowledge and human insight Web Mining is challenging: crawlers/bots, frequent site changes, etc. KDD Cup is an excellent source to learn the stateof-art KDD techniques KDD Cup data available for research and education 31 References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadstone/index.ht ml Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98release.html Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc 32 References (Cont.) Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt Salford System white papers: http://www.salford-systems.com/whitepaper.html Summary talk presented at KDD (2000) http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt 33