Download Data Mining: Emerging Trends, Challenges and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA SCIENCE APPLICATIONS
@DATAMININGAPPS
Prof. dr. Bart Baesens
Dr. Seppe vanden Broucke
Department of Decision Sciences and Information Management,
KU Leuven (Belgium)
School of Management, University of Southampton (United Kingdom)
{Bart.Baesens;Seppe.vandenBroucke}@kuleuven.be
Twitter/Facebook/Youtube: DataMiningApps
www.dataminingapps.com
The Analytics Process Model
Identify
Business
Problem
Identify
Data
Sources
Select
the
Data
Preprocessing
Clean
the
Data
Transform
the
Data
Analyze
the
Data
Intepret,
Evaluate,
and Deploy
the Model
Analytics
Baesens (2014), Analytics in a big data world: The essential guide to data
science and its applications
Postprocessing
Team members
•
•
•
•
•
•
Database/Datawarehouse administrator
Business expert (e.g. marketeer, credit risk analyst, …)
Legal expert
Data scientist/data miner
Software/tool vendors
A multidisciplinary team needs to be set up!
Data Scientist
• A data scientist should be a good programmer!
• A data scientist should have solid quantitative skills!
• A data scientist should excel in communication and
visualization skills!
• A data scientist should have a solid business
understanding!
• A data scientist should be creative!
Baesens, Weber, Bravo, vanden Broucke (2015), Hiring Data Scientists: what to look for!
Analytics
• Term often used interchangeably with data mining, knowledge
discovery, predictive/descriptive modeling, …
• Essentially refers to extracting useful business patterns and/or
mathematical decision models from a preprocessed data set
• Predictive analytics
– Predict the future based on patterns learnt from past data
– Classification (churn, response) versus regression (CLV)
• Descriptive analytics
– Describe patterns in data
– Clustering, Association rules, Sequence rules
Analytic Model requirements
• Business relevance
– Solve a particular business problem
• Statistical performance
– Statistical significance of model
– Statistical prediction performance
• Interpretability + Justifiability
– Very subjective (depends on decision maker), but CRUCIAL!
– Often need to be balanced against statistical performance
• Operational efficiency
– How can the analytical models be integrated with campaign management?
• Economical cost
– What is the cost to gather the model inputs and evaluate the model?
– Is it worthwhile buying external data and/or models?
• Regulatory compliance
– In accordance with regulation and legislation
Baesens et al (2003), Using neural network rule extraction and decision tables
for credit-risk evaluation
Verbraken, Verbeke, Baesens (2013), A novel profit maximizing metric for measuring
classification performance of customer churn prediction models
Post processing
• Interpretation and validation of analytical models by business
experts
• Trivial versus unexpected (interesting?) patterns
• Sensitivity analysis
• How sensitive is the model wrt sample characteristics, assumptions, etc.?
• Deploy analytical model into business setting
• Represent model output in a user-friendly way
• Integrate with campaign management tools and marketing decision engines
• Model monitoring and backtesting
• Continuously monitor model output
• Contrast model output with observed numbers
Castermans, Martens, Van Gestel, Hamers, Baesens (2010),
An overview and framework for PD backtesting and benchmarking
Applications
•
•
•
•
•
•
•
•
•
•
Credit scoring
Market basket analysis/recommender systems
Retention Modeling/churn prediction
Response modeling
On-line Analytics
Social Media Analytics
Social Network Analytics
Fraud Analytics
HR Analytics
Process Analytics
Credit Scoring
• Estimate probability of default at the time the applicant applies for the
loan!
• Use predetermined definition of default (e.g. 3 months of payment arrears)
• Use application variables
– E.g. age, income, marital status, years at address, years with employer, …
• Use bureau variables
– Bureau score, raw bureau data (e.g. number of credit checks, total amount of
credits, delinquency history ,…)
– In the US: Fico scores between 300 to 850
– Experian, Equifax, TransUnion
– E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the
Netherlands), CKP (Belgium), Dun & Bradstreet
Van Gestel, Baesens (2009), Credit Risk Management: Basic Concepts.
9
Example Credit Scorecard
Characteristic
Name
Attribute
Scorecard
Points
AGE 1
Up to 26
100
AGE 2
26 - 35
120
AGE 3
35 - 37
185
AGE 4
37+
225
GENDER 1
Male
90
GENDER 2
Female
180
SALARY 1
Up to 500
120
SALARY 2
501-1000
140
SALARY 3
1001-1500
160
SALARY 4
1501-2000
200
SALARY 5
2001+
240
Let cut-off = 500
So, a new customer applies for credit ……
AGE
GENDER
SALARY
Total
32
Female
$1,150
120 points
180 points
160 points
460 points
REFUSE CREDIT
Baesens et al (2003), Benchmarking state-of-the-art classification algorithms for credit scoring
10
Association rules
• Purpose
– Detect frequently occurring patterns between items
• How?
– Unsupervised data mining (no real target to optimise)
– Deriving association rules
• Example Applications
– Which products\services are frequently bought together?
– Which web pages are frequently visited together?
– Which terms often co-occur in a text document?
11
Association rules
• Notation:
– D: database of transactions tp
– each transaction tp consists of a transaction ID and a set of items {i1, i2 , …,
in} selected from all possible items I
• An association rule is an implication of the form:
X Y where X  I, Y  I and X  Y = 
X: rule antecedent, Y: rule consequent
• Example:
12
– If a customer has a car loan and car insurance, then the customer has a
checking account in 80% of the cases
– If a customer buys spaghetti, then the customer buys red wine in 70% of the
cases
Example transactions database
13
Transaction
Items
1
stella, hoegaarden, diapers, baby food
2
coke, stella, diapers
3
cigarettes, diapers, baby food
4
chocolates, diapers, hoegaarden, apples
5
tomatoes, water, leffe, stella
6
spaghetti, diapers, baby food, stella
7
water, stella, baby food
8
diapers, baby food, spaghetti
9
baby food, stella, diapers, hoegaarden
10
apples, chimay, baby food
Association Rules: Support and Confidence
• Support of an itemset is the percentage of total transactions in the
database that contains the itemset.
• The rule XY has support s if 100s% of the transactions in D contain
X  Y.
number of transacti ons supporting X  Y
support ( X  Y ) 
total number of transactio ns
• A frequent itemset is an itemset for which the support is higher than a
prespecified threshold (minsup).
• The rule X  Y has confidence c if 100c% of the transactions in D that
contain X also contain Y.
support(X  Y)
confidence ( X  Y )  P(Y | X ) 
support(X)
14
Associations: Support and Confidence
Transaction
Items
1
stella, hoegaarden, diapers, baby food
2
coke, stella, diapers
3
cigarettes, diapers, baby food
4
chocolates, diapers, hoegaarden, apples
5
tomatoes, water, leffe, stella
6
spaghetti, diapers, baby food, stella
7
water, stella, baby food
8
diapers, baby food, spaghetti
9
baby food, stella, diapers, hoegaarden
10
apples, chimay, baby food
E.g. itemset {baby food, diapers, stella } has support = 3/10 or 30%
Association Rule: baby food, diapers  stella has confidence of 3/5 or 60%
15
Association rule discovery
• Often lots of association rules will be discovered
• Post-processing is a necessity
• Perform sensitivity analysis using minsup and minconf
thresholds
• Trivial rules, e.g., buy spaghetti and spaghetti sauce
• Unexpected/Unknown rules
• Novel and actionable patterns, potentially interesting!
• Appropriate visualisation facilities are crucial!
Baesens et al. (2000), Post-processing of association rules
16
Market Basket Analysis
baby food, diapers  stella
1.
2.
3.
4.
Put them closer together in the store.
Put them far apart in the store.
Package baby food, diapers and stella.
Package baby food, diapers and stella + poorly selling
item.
5. Raise the price on one, and lower it on the other.
6. Do not advertise baby food, diapers and stella together
17
Recommender Systems
• Help people make decisions by giving them recommendations.
• Recommendations are based on preferences of individuals/groups.
• Examples
– In e-Business, recommend items.
– In e-Learning, recommend content.
– In search and navigation, recommend links.
• Netflix competition
– Predict whether someone will enjoy a movie
based on how much they liked or disliked
other movies.
• Amazon, Ebay, …
Seret, Verbraken, Versailles, Baesens (2012), A new SOM-based method for profile
generation: Theory and an application in direct marketing
18
Example: Recommender Systems
19
Retention modeling/Churn prediction
• Understanding why customers leave you
• Customer Retention is important because long term loyal
customers are less price sensitive, cost less to serve and have a
higher lifetime value
• Small improvements in customer retention generate significant
returns.
• Very important in Telco sector (about 2% monthly churn rate)
• Transaction versus Relationship buyers
• Transaction buyers: buy because of low price
• Relationship buyers: want to build loyal relationship with firm
Glady, Baesens, Croux (2009), Modeling churn using customer lifetime value
20
Defining churn
• Contractual versus Non-contractual setting
• Contractual setting: customer cancels contract (e.g. postpaid Telco)
• Non-contractual setting: customer hasn’t purchased any products or
services during previous 3 months (e.g. online retailer)
• Types of churn
•
•
•
•
Active: customer stops relationship with firm
Passive: customer decreases intensity of relationship
Forced: company stops relationship because of e.g. fraud
Expected: customer no longer needs product/service (e.g. baby
products)
Baesens et al. (2002), Bayesian neural network learning for repeat purchase modelling in
direct marketing
21
Churn prediction: types of predictors
• Demographic data
• E.g., age, gender, marital status
• Relationship variables
• E.g. length of relationship, number of products purchased, …
• Product\Service usage data
• E.g., number of transactions in previous month, trend in usage, ….
• Complaints data
• Number of filed complaints, Service desk contacted, …
• RFM data
• (Social) network information (cf. infra)
22
RFM Framework
•
•
•
•
•
Already popular since (Cullinan, 1977)
Recency: Number of months since last purchase
Frequency: Number of purchases within a given time frame
Monetary: dollar value of purchases
Different operationalisations of RFM variables
• E.g., Monetary: average/maximum/total dollar value?
• Trend variables
• Can only be measured for existing customers, not for prospects
(e.g. response modeling)
• Often used to build a segmentation scheme or combine into a
single RFM score
23
Response modeling
• Customer acquisition: acquiring new customers with
targeted campaigns, win-back campaigns
• Campaign can be mail catalogue, email, coupon, A/B
or multivariate testing, ….
• Identify the customers most likely to respond based
on the following information:
24
• Demographic variables (age, gender, marital status, …)
• Relationship variables (length of relationship, number of
products purchased, …)
• RFM variables
• (social) network information (cf. infra)
Response modeling setup
• Split target group into test group and control group
• Test group receives marketing material and control
group does not
• Incremental impact equals the additional
purchases that are directly attributable to the
campaign (Larsen, 2010)
• Incremental impact=test group purchase rate control group purchase rate
25
Baesens (2014), Analytics in a big data world: The essential guide to data
science and its applications
Measuring incremental impact (Larsen, 2010)
• Try to factor in the behavior of self-selecting clients,
clients that purchase regardless of the marketing
campaign
• Focus should be on swing clients: interested in the
product, but need to be motivated (by e.g. marketing
message) to take action
• Both test and control group should be representative
• Find a model such that the difference between the test
group purchase rate and the control group purchase rate
is maximized (i.e. identifying the swing clients)
26
Gross versus Net Lift Models (Lo, 2002)
Net Lift
Gross Lift
Previous Campaign data
Previous Campaign data
Control
Test
Holdout
data
Training
data
27
Control
Model
Test
Training
data
Holdout
data
Model
Net Lift models (Larsen, 2010)
Self-selectors
Test group
Converted
swing clients
Y=1
No purchase
Y=0
Self-selectors
Y=1
Control group Swing clients
No purchase
28
Y=0
Building a Difference Score Model
• Step 1: Build a logistic regression model estimating
probability of purchase given treatment, P(purchase|test)
• Step 2: Build a logistic regression model estimating
probability of purchase given control, P(purchase|control)
• Step 3: Incremental score=P(purchase|test)P(purchase|control)
Note: to understand the impact of the predictors,
regress the incremental lift scores on the original data!
29
On-line analytics: example questions
• How do customers find my website (Google,
Facebook, …)?
• How to optimise my on-line marketing mix (e.g.
Google SEO versus Google Adwords)?
• Where am I sending customers to?
• What is the average time customers spend on my
website?
• How can I customise the on-line experience?
• How to measure customer engagement?
• …
On-line analytics: data collection
• Web server logs (server side)
195.162.218.155 - - [27/Jun/2002:00:01:54 +0200]
"GET /dutch/shop/detail.html HTTP/1.1" 200 38890
"http://www.msn.be/shopping/food/" "Mozilla/4.0 (MSIE 6.0)"
• Page tagging (client side)
– “tagging” web page with a code snippet referencing a
separate JavaScript file
• Cookies
– small text string that a Web server can send to a visitor's
Web browser (as part of its HTTP response)
– privacy! (cf. regulatory compliance)
KPI monitoring using dashboards
On-line Analytics: challenges
• Extremely messy data
– Extensive preprocessing needed
– Focus on trends + segmentation
• Information overload: too many metrics!
• Focus on actionable metrics
– Bounce rate: ratio of visits where visitor left instantly
– Conversion rate: percentage of visitors for which we observed the
event (e.g. purchase, pdf download, registration, …)
• Integrate on-line with off-line customer data!
Huysmans, Mues, Vanthienen, Baesens (2004),
Web Usage Mining with Time Constrained Association Rules.
Social Media Analytics
• Analyse on-line social media data (e.g. Twitter feeds, Facebook
messages, …)
• Applications
•
•
•
Corporate reputation and sentiment analysis
Identification of key themes, opinions and trending topics
Social Graphing and Viral Tracking
• Develop a social CRM strategy!
Social Network Analytics
• Networked data
• Telephone calls
• Facebook, Twitter, LinkedIn, …
• Web pages connected by
hyperlinks
• Research papers connected by
citations
• Terrorism networks
• Applications
• Product recommendations
• Churn detection
• Web page classification
• Fraud detection
• Terrorism detection
?
Baesens (2014), Analytics in a big data world: The essential
guide to data science and its applications
Example: Social Networks in a Telco context
• Traditional churn prediction models treat customers as
isolated entities
• However, customers are strongly influenced by their social
environment:
– recommendations from peers, mouth-to-mouth publicity
– social leader influence
– promotional offers from operators to
acquire groups of friends
– reduced tarifs for intra-operator traffic
 take into account the
customers’ social network!
Verbeke, Martens, Baesens (2014), Social network analysis
for customer churn prediction
Fraud Analytics
• Fraud is an uncommon, well-considered, imperceptibly
concealed, time-evolving and often carefully organized crime
which appears in many types and forms.
Baesens, Van Vlasselaer, Verbeke (2015), Fraud Analytics using Descriptive,
Predictive and Social Network Techiques.
Fraud Analytics
Credit card transaction fraud:
• Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes)
• Store itself also processes legitimate transactions to cover their fraudulent activities
Fraud Analytics
Identify theft:
• Before: person calls his/her frequent contacts
• After: person also calls new contacts which coincidentally overlap with another persons contacts.
before
after
Fraud Analytics
Social security fraud:
• Companies are frequently associated with other companies that perpetrate
suspicious/fraudulent activities.
Van Vlasselaer, Eliassi-Rad, Akoglu, Snoeck, Baesens (2016),
Gotcha! Network-based fraud detection for social security fraud
HR analytics
•
•
•
•
•
•
Employee churn
Employee performance
Employee absence
Employee satisfaction
Employee Lifetime Value
…
Example Absenteeism scorecard
Characteristic Name
Attribute
Points
So, a new employee needs to be scored:
Age
Age
32
Function Manager 180 points
Department
Finance
120 points
Function
Total
460 points
Up to 26
26-35
35-37
37+
No-manager
Manager
HR
Marketing
Finance
Production
IT
100
120
185
225
90
180
120
140
160
200
240
Let cutoff = 500
No Absenteeism!
160 points
Department
Baesens (2014), 5 Reasons to Start with Predictive Employee Turnover Analytics.
Hiring & Firing
Baesens, De Winne, Sels,
What to Do Before You Fire a
Pivotal Employee, 2016
Process Analytics
• Extracting knowledge
from event logs of
information systems
– Control flow perspective
– Organizational perspective
– Information perspective
De Weerdt, De Backer, Vanthienen, Baesens (2012), A multi-dimensional quality assessment of state-of-the-art process
discovery algorithms using real-life event logs
Process Analytics: Example
Make order form
Case
ID
Activity Name
Originator
Timestamp
Extra
Data
001
Make order form
Mary
20-07-2010 14:02:06
…
002
Make order form
Jane
20-07-2010 15:45:29
…
001
Scan invoice
John
10-08-2010 09:52:31
…
001
Central registration
John
10-08-2010 10:00:36
…
002
Scan invoice
John
11-08-2010 09:15:22
…
002
Central registration
John
11-08-2010 09:20:01
…
001
Accepted
Sophie
13-08-2010 08:20:54
…
002
Accepted
Sophie
13-08-2010 08:21:12
…
001
Decentral rejection
Mary
14-08-2010 14:15:14
…
001
End
System
14-08-2010 14:15:15
…
002
Decentral approval
Jane
16-08-2010 19:22:56
…
002
Invoice booked
System
16-08-2010 19:22:59
…
002
End
System
16-08-2010 19:23:00
…
003
Make order form
Mary
19-08-2010 07:52:41
…
004
Make order form
Mary
19-08-2010 15:21:39
…
Scan invoice
Genetic Miner
HeuristicsMiner
Central registration
Rejected
Decentral revision
Accepted
Decentral approval
Invoice booked
Decentral rejection
AGNEs Miner
End