Download ATO Datamining Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

RELX Group wikipedia , lookup

Transcript
31 January 2005
INTERNAL
ATO – Various
4 June 2009
SEGMENT
AUDIENCE
DATE
Analytics: Data Mining for Risk and Compliance
Name of Presenter
Title of Presenter
Analytics, Office of the Chief Knowledge Officer
Version 1.0
Analytics: Data Mining for Risk and Compliance
1
Overview
 Analytics and the Data Mining Process
 Exploring Data
 Supervised Modelling
 Unsupervised Modelling
 Data Matching
 Analytics Project Achievements
Analytics: Data Mining for Risk and Compliance
2
Analytics and the Data Mining Process
The Shape and Form of a
Data Mining Project
Analytics: Data Mining for Risk and Compliance
3
Analytics
Under Office of the Chief Knowledge Officer, and is part of EST sub-plan
Established as a National capability in 2003
Team has been built up to 19 data mining specialists, representing the largest data mining
team in Australia.
Working with up to 60 analysts throughout the organisation to spread the new technology
and provide an over arching framework for Risk Management for the ATO.
The National team works closely with Business Lines to both deliver new risk models and
to transfer skills and technology
Analytics Community of Practise meets weekly to share experiences and technology, and
to peer review modelling across the ATO.
Analytics: Data Mining for Risk and Compliance
4
Analytics Functions
Deploy data mining,
 Working with business lines to deliver new risk models
 Improved strike rates and more efficient usage of limited resources
 Analytics Community of Practise
 Weekly meetings and emailing lists to share experiences and to introduce new
technologies
AnalyticsNet Infrastructure
 New 64bit hardware to allow our large datasets to be analysed in memory (32GB
memory)
 Sharing of new tools and technology
Analytics Training
 Beginning a series of courses introducing data mining
 A hands-on approach – kick start with own data
Analytics: Data Mining for Risk and Compliance
5
Analytics and Traditional Modelling
 Analytics brings a different, but complementary and advanced, approach
to modelling and predicting client behaviour.
 Traditional modelling approaches explore client data and couple this with
an understanding of financial processes to build mathematical models to
simulate these processes, and to then identify non-compliance to models.
 Analytics, using data mining technology, supplements traditional
modelling approaches by modelling from the data – using powerful tools to
automatically search for interesting, unusual, unexpected, patterns that
indicate non-compliance – a data driven approach.
Analytics: Data Mining for Risk and Compliance
6
 Data driven approach
 Crucial to have the right data
- Clean
- Relevant
- Before the event
 An data mining project is a joint process between the
business experts and data miners
- business problem
- business processes
- data
Analytics: Data Mining for Risk and Compliance
7
CRISP-DM
The Data Mining Process
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modelling
5. Evaluation
6. Deployment
Sourer: http://www.crisp-dm.org/Process/index.htm
Analytics: Data Mining for Risk and Compliance
8
Applying results of data mining…
1
2
3
4
Apply
New Risk
Segmentation
Tune
Screening
Rules
Optimise a
Treatment
Strategy
Optimise
Treatment
Portfolio
Instead of using
$ value or market
segment as proxy
for risk, identify
actual group and its
characteristics.
Adjust screening
rules (thresholds,
ratios, exceptions)
to reflect better
understanding of
risk.
Find the optimal
point to maximise
revenue collection,
while minimising
caseload and
occurrence of
fraud.
Find the optimal
point to maximise
revenue collection,
while minimising
caseload and
occurrence of fraud
– for the whole of
treatment portfolio.
Look at adjusting,
combining rules.
Can be applied
straight away.
Apply risk scores
to case selection to
get best overall
outcomes.
Create new
language and
awareness of risk.
Optimise the
treatment mix
Degree of Sophistication
Optimisation is more than picking the right clients – the right treatment and right
work mix also need to be optimised…
Analytics: Data Mining for Risk and Compliance
9
Client Scoring for treatment selection…
So we can personalise our treatment
strategies to the client
Letter X
Decision Tree of Rules
derived from data to assign scores
Letter Y
Treatment – Audit
Call
Treatment – Review
Decision Tree
Score
1000
950
900
850
800
750
700
650
600
550
500
450
400
350
300
250
200
150
100
50
0
Analytics: Data Mining for Risk and Compliance
Neural Net
Rule Induction
Regression
DM Neural
In fact scores are likely to be done via
several models ‘voting’ together –
Ensembles.
10
Moving Forward with Analytics
The low hanging fruit for Data Mining is the large collection of
outcomes from audit activity – this has been a primary focus in the first
instance.
It is a more difficult data mining task to identify emerging risks, but
technology for identifying emergent patterns is becoming available.
Text mining and social network analysis will significantly enhance our
Intelligence and Risk Modelling capabilities.
Deployment of Analytics through Operational Analytics
 How best to deploy Analytics Models – new territory
 Translate models to SQL or leave in native language (R, SAS, Java)?
 Computational requirements of SQL over the Data Warehouse
Analytics: Data Mining for Risk and Compliance
11
Supervised Modelling
Working From What We Know
To Build Models
To Automate “Case Selection”
Analytics: Data Mining for Risk and Compliance
12
Supervised modelling
 predict some value or outcome having seen a number of
training examples
- training data will have a ‘target’ variable
- prediction can be a continuous variable, or a class
 model ‘learns’ from training data, and is tested on
‘unseen’ cases
Analytics: Data Mining for Risk and Compliance
13
Effect of Adding More Data – Data is Fundamental
Base Data
80
60
40
Performance (%)
60
40
Revenue
Recall
Precision
Revenue
Recall
Precision
20
20
Performance (%)
80
100
100
Client History
4%
0
0
7%
0
20
40
60
80
Caseload (%)
Analytics: Data Mining for Risk and Compliance
100
0
20
40
60
80
100
Caseload (%)
14
New Technologies
 Regression
 Decision Trees
 Random Forests
 Boosted Trees
 Support Vector Machines
 Neural Networks
Analytics: Data Mining for Risk and Compliance
15
Unsupervised Modelling
A Data Driven Approach to
Identifying – Exploring – Understanding
Client Groups
Analytics: Data Mining for Risk and Compliance
16
Unsupervised modelling
 A class of problems in which one seeks to determine how
the data are organised
 Distinct from supervised modelling in that the data have no
‘target’ variable
 Seek to summarise and explain key features of the data.
Analytics: Data Mining for Risk and Compliance
17
Cluster Analysis
 Seeks to identify homogeneous subgroups in a population
 establish groups and then analyse group membership
 discovers structures in data without explaining why they
exist
 mostly used when no a priori hypotheses, but are still in
the exploratory phase of our research
 use to classify large amounts of information into
manageable meaningful piles
Analytics: Data Mining for Risk and Compliance
18
 Omitted Income – outlier detection
outlie
r
outlier
outlie
r
outlier
Analytics: Data Mining for Risk and Compliance
19
Self Organising Maps (SOM)
 A self-organizing map is a special type of artificial neural
network which performs unsupervised competitive learning
(Kohonen, 1982)
 Useful for visualising low-dimensional views of highdimensional data
 Plot the similarities of the data by grouping similar data
items together
Analytics: Data Mining for Risk and Compliance
20
Debt Behaviour - Self Organising Maps
Aim: understand the logic and structures
that drive tax payers’ compliance
behaviour (behavioural archetypes).
Construct ‘psychographic groups’ (Wells
1975) by using data mining clusters – each
cell in the “map” represents thousands of
entities who are similar across many
characteristics.
Identify hot spots which indicate high
levels of “activity” associated with different
characteristics.
6.5 Million entities in total population
Analytics: Data Mining for Risk and Compliance
21
Text Mining of Complex Documents
 Large collections of documents (unstructured data from
multiple sources including source systems, client hard drives
and scanned material) need to be reviewed
 Task: systematically sift the required information from the
“noise”
 Aim: Reduce the time taken to identify those documents
that support compliance treatment
Analytics: Data Mining for Risk and Compliance
22
Associated Entities
Identifying and understanding Associated Entities is important in many
different Taxation contexts.
 Debt:
 Linking Associated Entities is important in understanding an entity’s
Propensity and Capacity to Pay and then in modelling their debt risk.
 Entities are associated through partnerships, directorships, and
consolidated groups where we need to identify the ultimate holding
company.
One Degree of Separation
 Lodgment:
 Analyse lodgment behaviour and risk to revenue by knowing
relationships between Associated Entities.
 Relationships derived from the linkages could be used for identifying
“leverage” points for more effective treatment strategies.
 Tool is in the early stages of development.
Colours:
Companies Government
Individuals
Partnerships Superannuation Trusts
Triangle = non lodged; Circle = lodged; Size = Ind … Large
Analytics: Data Mining for Risk and Compliance
Two Degrees of Separation
23
Associated Entities
One Degree of Separation
Two Degrees of Separation
Three Degrees of Separation
Companies
Government
Individuals
Partnerships
Suoerannuation
Trusts
Triangle = non lodged
Circle = lodged
Four Degrees of Separation
Size = Ind … Large
Five Degrees of Separation
Analytics: Data Mining for Risk and Compliance
24
Data Matching
AUSTRAC
Internal data
Analytics: Data Mining for Risk and Compliance
25
Analytics Project Achievements
Application of Data Mining
in the ATO
Analytics: Data Mining for Risk and Compliance
26
Data mining at work
Analytics: Data Mining for Risk and Compliance
27
Intangible Effect of Data Mining
Analytics: Data Mining for Risk and Compliance
28
Other projects
Ceased Business
Failure to Lodge
IT Return Not Necessary
Propensity to Lodge
Risk to Revenue – FBT
Strategy Evaluation and Improvement
In House Prosecutions
Risk to Information
Risk Score Associated Entities
Risk to Reputation
Analytics: Data Mining for Risk and Compliance
29
Some New Points of View
Fraud found at the edge or boundary of pockets of activity
rather than being outliers
Outliers
Boundary Cases
Analytics: Data Mining for Risk and Compliance
30
Scattergram (Taylor-Russell Table)
Aberrant Cases
Baseline
Separating
Aberrant
from
Acceptable
Cases
False Negatives
True Negatives
Acceptable Cases
Analytics: Data Mining for Risk and Compliance
True
Positives
False
Positives
Cutoff used by Classifier
31
Role of Expertise
Need to develop procedures/methods for capturing the knowledge, skills
and strategies experts employ yo identify non compliance or the smell
factor with cases and to incorporate these as routines and models in our
discovery and detection systems
Examples include the expertise used for
 Risk Identification
 Feature Selection
 Classifications of Cases
Analytics: Data Mining for Risk and Compliance
32