Download Data Mining: Emerging Trends, Challenges and Applications

DATA SCIENCE APPLICATIONS @DATAMININGAPPS Prof. dr. Bart Baesens Dr. Seppe vanden Broucke Department of Decision Sciences and Information Management, KU Leuven (Belgium) School of Management, University of Southampton (United Kingdom) {Bart.Baesens;Seppe.vandenBroucke}@kuleuven.be Twitter/Facebook/Youtube: DataMiningApps www.dataminingapps.com The Analytics Process Model Identify Business Problem Identify Data Sources Select the Data Preprocessing Clean the Data Transform the Data Analyze the Data Intepret, Evaluate, and Deploy the Model Analytics Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications Postprocessing Team members • • • • • • Database/Datawarehouse administrator Business expert (e.g. marketeer, credit risk analyst, …) Legal expert Data scientist/data miner Software/tool vendors A multidisciplinary team needs to be set up! Data Scientist • A data scientist should be a good programmer! • A data scientist should have solid quantitative skills! • A data scientist should excel in communication and visualization skills! • A data scientist should have a solid business understanding! • A data scientist should be creative! Baesens, Weber, Bravo, vanden Broucke (2015), Hiring Data Scientists: what to look for! Analytics • Term often used interchangeably with data mining, knowledge discovery, predictive/descriptive modeling, … • Essentially refers to extracting useful business patterns and/or mathematical decision models from a preprocessed data set • Predictive analytics – Predict the future based on patterns learnt from past data – Classification (churn, response) versus regression (CLV) • Descriptive analytics – Describe patterns in data – Clustering, Association rules, Sequence rules Analytic Model requirements • Business relevance – Solve a particular business problem • Statistical performance – Statistical significance of model – Statistical prediction performance • Interpretability + Justifiability – Very subjective (depends on decision maker), but CRUCIAL! – Often need to be balanced against statistical performance • Operational efficiency – How can the analytical models be integrated with campaign management? • Economical cost – What is the cost to gather the model inputs and evaluate the model? – Is it worthwhile buying external data and/or models? • Regulatory compliance – In accordance with regulation and legislation Baesens et al (2003), Using neural network rule extraction and decision tables for credit-risk evaluation Verbraken, Verbeke, Baesens (2013), A novel profit maximizing metric for measuring classification performance of customer churn prediction models Post processing • Interpretation and validation of analytical models by business experts • Trivial versus unexpected (interesting?) patterns • Sensitivity analysis • How sensitive is the model wrt sample characteristics, assumptions, etc.? • Deploy analytical model into business setting • Represent model output in a user-friendly way • Integrate with campaign management tools and marketing decision engines • Model monitoring and backtesting • Continuously monitor model output • Contrast model output with observed numbers Castermans, Martens, Van Gestel, Hamers, Baesens (2010), An overview and framework for PD backtesting and benchmarking Applications • • • • • • • • • • Credit scoring Market basket analysis/recommender systems Retention Modeling/churn prediction Response modeling On-line Analytics Social Media Analytics Social Network Analytics Fraud Analytics HR Analytics Process Analytics Credit Scoring • Estimate probability of default at the time the applicant applies for the loan! • Use predetermined definition of default (e.g. 3 months of payment arrears) • Use application variables – E.g. age, income, marital status, years at address, years with employer, … • Use bureau variables – Bureau score, raw bureau data (e.g. number of credit checks, total amount of credits, delinquency history ,…) – In the US: Fico scores between 300 to 850 – Experian, Equifax, TransUnion – E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet Van Gestel, Baesens (2009), Credit Risk Management: Basic Concepts. 9 Example Credit Scorecard Characteristic Name Attribute Scorecard Points AGE 1 Up to 26 100 AGE 2 26 - 35 120 AGE 3 35 - 37 185 AGE 4 37+ 225 GENDER 1 Male 90 GENDER 2 Female 180 SALARY 1 Up to 500 120 SALARY 2 501-1000 140 SALARY 3 1001-1500 160 SALARY 4 1501-2000 200 SALARY 5 2001+ 240 Let cut-off = 500 So, a new customer applies for credit …… AGE GENDER SALARY Total 32 Female $1,150 120 points 180 points 160 points 460 points REFUSE CREDIT Baesens et al (2003), Benchmarking state-of-the-art classification algorithms for credit scoring 10 Association rules • Purpose – Detect frequently occurring patterns between items • How? – Unsupervised data mining (no real target to optimise) – Deriving association rules • Example Applications – Which products\services are frequently bought together? – Which web pages are frequently visited together? – Which terms often co-occur in a text document? 11 Association rules • Notation: – D: database of transactions tp – each transaction tp consists of a transaction ID and a set of items {i1, i2 , …, in} selected from all possible items I • An association rule is an implication of the form: X Y where X  I, Y  I and X  Y =  X: rule antecedent, Y: rule consequent • Example: 12 – If a customer has a car loan and car insurance, then the customer has a checking account in 80% of the cases – If a customer buys spaghetti, then the customer buys red wine in 70% of the cases Example transactions database 13 Transaction Items 1 stella, hoegaarden, diapers, baby food 2 coke, stella, diapers 3 cigarettes, diapers, baby food 4 chocolates, diapers, hoegaarden, apples 5 tomatoes, water, leffe, stella 6 spaghetti, diapers, baby food, stella 7 water, stella, baby food 8 diapers, baby food, spaghetti 9 baby food, stella, diapers, hoegaarden 10 apples, chimay, baby food Association Rules: Support and Confidence • Support of an itemset is the percentage of total transactions in the database that contains the itemset. • The rule XY has support s if 100s% of the transactions in D contain X  Y. number of transacti ons supporting X  Y support ( X  Y )  total number of transactio ns • A frequent itemset is an itemset for which the support is higher than a prespecified threshold (minsup). • The rule X  Y has confidence c if 100c% of the transactions in D that contain X also contain Y. support(X  Y) confidence ( X  Y )  P(Y | X )  support(X) 14 Associations: Support and Confidence Transaction Items 1 stella, hoegaarden, diapers, baby food 2 coke, stella, diapers 3 cigarettes, diapers, baby food 4 chocolates, diapers, hoegaarden, apples 5 tomatoes, water, leffe, stella 6 spaghetti, diapers, baby food, stella 7 water, stella, baby food 8 diapers, baby food, spaghetti 9 baby food, stella, diapers, hoegaarden 10 apples, chimay, baby food E.g. itemset {baby food, diapers, stella } has support = 3/10 or 30% Association Rule: baby food, diapers  stella has confidence of 3/5 or 60% 15 Association rule discovery • Often lots of association rules will be discovered • Post-processing is a necessity • Perform sensitivity analysis using minsup and minconf thresholds • Trivial rules, e.g., buy spaghetti and spaghetti sauce • Unexpected/Unknown rules • Novel and actionable patterns, potentially interesting! • Appropriate visualisation facilities are crucial! Baesens et al. (2000), Post-processing of association rules 16 Market Basket Analysis baby food, diapers  stella 1. 2. 3. 4. Put them closer together in the store. Put them far apart in the store. Package baby food, diapers and stella. Package baby food, diapers and stella + poorly selling item. 5. Raise the price on one, and lower it on the other. 6. Do not advertise baby food, diapers and stella together 17 Recommender Systems • Help people make decisions by giving them recommendations. • Recommendations are based on preferences of individuals/groups. • Examples – In e-Business, recommend items. – In e-Learning, recommend content. – In search and navigation, recommend links. • Netflix competition – Predict whether someone will enjoy a movie based on how much they liked or disliked other movies. • Amazon, Ebay, … Seret, Verbraken, Versailles, Baesens (2012), A new SOM-based method for profile generation: Theory and an application in direct marketing 18 Example: Recommender Systems 19 Retention modeling/Churn prediction • Understanding why customers leave you • Customer Retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value • Small improvements in customer retention generate significant returns. • Very important in Telco sector (about 2% monthly churn rate) • Transaction versus Relationship buyers • Transaction buyers: buy because of low price • Relationship buyers: want to build loyal relationship with firm Glady, Baesens, Croux (2009), Modeling churn using customer lifetime value 20 Defining churn • Contractual versus Non-contractual setting • Contractual setting: customer cancels contract (e.g. postpaid Telco) • Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e.g. online retailer) • Types of churn • • • • Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e.g. fraud Expected: customer no longer needs product/service (e.g. baby products) Baesens et al. (2002), Bayesian neural network learning for repeat purchase modelling in direct marketing 21 Churn prediction: types of predictors • Demographic data • E.g., age, gender, marital status • Relationship variables • E.g. length of relationship, number of products purchased, … • Product\Service usage data • E.g., number of transactions in previous month, trend in usage, …. • Complaints data • Number of filed complaints, Service desk contacted, … • RFM data • (Social) network information (cf. infra) 22 RFM Framework • • • • • Already popular since (Cullinan, 1977) Recency: Number of months since last purchase Frequency: Number of purchases within a given time frame Monetary: dollar value of purchases Different operationalisations of RFM variables • E.g., Monetary: average/maximum/total dollar value? • Trend variables • Can only be measured for existing customers, not for prospects (e.g. response modeling) • Often used to build a segmentation scheme or combine into a single RFM score 23 Response modeling • Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns • Campaign can be mail catalogue, email, coupon, A/B or multivariate testing, …. • Identify the customers most likely to respond based on the following information: 24 • Demographic variables (age, gender, marital status, …) • Relationship variables (length of relationship, number of products purchased, …) • RFM variables • (social) network information (cf. infra) Response modeling setup • Split target group into test group and control group • Test group receives marketing material and control group does not • Incremental impact equals the additional purchases that are directly attributable to the campaign (Larsen, 2010) • Incremental impact=test group purchase rate control group purchase rate 25 Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications Measuring incremental impact (Larsen, 2010) • Try to factor in the behavior of self-selecting clients, clients that purchase regardless of the marketing campaign • Focus should be on swing clients: interested in the product, but need to be motivated (by e.g. marketing message) to take action • Both test and control group should be representative • Find a model such that the difference between the test group purchase rate and the control group purchase rate is maximized (i.e. identifying the swing clients) 26 Gross versus Net Lift Models (Lo, 2002) Net Lift Gross Lift Previous Campaign data Previous Campaign data Control Test Holdout data Training data 27 Control Model Test Training data Holdout data Model Net Lift models (Larsen, 2010) Self-selectors Test group Converted swing clients Y=1 No purchase Y=0 Self-selectors Y=1 Control group Swing clients No purchase 28 Y=0 Building a Difference Score Model • Step 1: Build a logistic regression model estimating probability of purchase given treatment, P(purchase|test) • Step 2: Build a logistic regression model estimating probability of purchase given control, P(purchase|control) • Step 3: Incremental score=P(purchase|test)P(purchase|control) Note: to understand the impact of the predictors, regress the incremental lift scores on the original data! 29 On-line analytics: example questions • How do customers find my website (Google, Facebook, …)? • How to optimise my on-line marketing mix (e.g. Google SEO versus Google Adwords)? • Where am I sending customers to? • What is the average time customers spend on my website? • How can I customise the on-line experience? • How to measure customer engagement? • … On-line analytics: data collection • Web server logs (server side) 195.162.218.155 - - [27/Jun/2002:00:01:54 +0200] "GET /dutch/shop/detail.html HTTP/1.1" 200 38890 "http://www.msn.be/shopping/food/" "Mozilla/4.0 (MSIE 6.0)" • Page tagging (client side) – “tagging” web page with a code snippet referencing a separate JavaScript file • Cookies – small text string that a Web server can send to a visitor's Web browser (as part of its HTTP response) – privacy! (cf. regulatory compliance) KPI monitoring using dashboards On-line Analytics: challenges • Extremely messy data – Extensive preprocessing needed – Focus on trends + segmentation • Information overload: too many metrics! • Focus on actionable metrics – Bounce rate: ratio of visits where visitor left instantly – Conversion rate: percentage of visitors for which we observed the event (e.g. purchase, pdf download, registration, …) • Integrate on-line with off-line customer data! Huysmans, Mues, Vanthienen, Baesens (2004), Web Usage Mining with Time Constrained Association Rules. Social Media Analytics • Analyse on-line social media data (e.g. Twitter feeds, Facebook messages, …) • Applications • • • Corporate reputation and sentiment analysis Identification of key themes, opinions and trending topics Social Graphing and Viral Tracking • Develop a social CRM strategy! Social Network Analytics • Networked data • Telephone calls • Facebook, Twitter, LinkedIn, … • Web pages connected by hyperlinks • Research papers connected by citations • Terrorism networks • Applications • Product recommendations • Churn detection • Web page classification • Fraud detection • Terrorism detection ? Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications Example: Social Networks in a Telco context • Traditional churn prediction models treat customers as isolated entities • However, customers are strongly influenced by their social environment: – recommendations from peers, mouth-to-mouth publicity – social leader influence – promotional offers from operators to acquire groups of friends – reduced tarifs for intra-operator traffic  take into account the customers’ social network! Verbeke, Martens, Baesens (2014), Social network analysis for customer churn prediction Fraud Analytics • Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms. Baesens, Van Vlasselaer, Verbeke (2015), Fraud Analytics using Descriptive, Predictive and Social Network Techiques. Fraud Analytics Credit card transaction fraud: • Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) • Store itself also processes legitimate transactions to cover their fraudulent activities Fraud Analytics Identify theft: • Before: person calls his/her frequent contacts • After: person also calls new contacts which coincidentally overlap with another persons contacts. before after Fraud Analytics Social security fraud: • Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities. Van Vlasselaer, Eliassi-Rad, Akoglu, Snoeck, Baesens (2016), Gotcha! Network-based fraud detection for social security fraud HR analytics • • • • • • Employee churn Employee performance Employee absence Employee satisfaction Employee Lifetime Value … Example Absenteeism scorecard Characteristic Name Attribute Points So, a new employee needs to be scored: Age Age 32 Function Manager 180 points Department Finance 120 points Function Total 460 points Up to 26 26-35 35-37 37+ No-manager Manager HR Marketing Finance Production IT 100 120 185 225 90 180 120 140 160 200 240 Let cutoff = 500 No Absenteeism! 160 points Department Baesens (2014), 5 Reasons to Start with Predictive Employee Turnover Analytics. Hiring & Firing Baesens, De Winne, Sels, What to Do Before You Fire a Pivotal Employee, 2016 Process Analytics • Extracting knowledge from event logs of information systems – Control flow perspective – Organizational perspective – Information perspective De Weerdt, De Backer, Vanthienen, Baesens (2012), A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs Process Analytics: Example Make order form Case ID Activity Name Originator Timestamp Extra Data 001 Make order form Mary 20-07-2010 14:02:06 … 002 Make order form Jane 20-07-2010 15:45:29 … 001 Scan invoice John 10-08-2010 09:52:31 … 001 Central registration John 10-08-2010 10:00:36 … 002 Scan invoice John 11-08-2010 09:15:22 … 002 Central registration John 11-08-2010 09:20:01 … 001 Accepted Sophie 13-08-2010 08:20:54 … 002 Accepted Sophie 13-08-2010 08:21:12 … 001 Decentral rejection Mary 14-08-2010 14:15:14 … 001 End System 14-08-2010 14:15:15 … 002 Decentral approval Jane 16-08-2010 19:22:56 … 002 Invoice booked System 16-08-2010 19:22:59 … 002 End System 16-08-2010 19:23:00 … 003 Make order form Mary 19-08-2010 07:52:41 … 004 Make order form Mary 19-08-2010 15:21:39 … Scan invoice Genetic Miner HeuristicsMiner Central registration Rejected Decentral revision Accepted Decentral approval Invoice booked Decentral rejection AGNEs Miner End

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining: Emerging Trends, Challenges and Applications