Download Building In-Database Predictive Scoring Model: Check Fraud

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Building In-Database Predictive Scoring Model:
Check Fraud Detection Case Study
Jay Zhou, Ph.D.
Business Data Miners, LLC
978-726-3182
[email protected]
Web Site: www.businessdataminers.com
LinkedIn: www.linkedin.com/in/jiangzhou
Copyright © Business Data Miners.
1
Building In-Database Predictive Scoring
Model: Check Fraud Detection Case Study
• Four Components
–Business: Check Fraud
–Scoring Model: Logistic Regression
–Modeling Tool: Oracle Data Mining
–Data Storage/Management: Oracle
Database
Copyright © Business Data Miners.
2
Business
Marketing
Fraud
LTV
Churn
Credit
Data Mining/Predictive Analytics Methods
Tree
Logistic
reg
Gradient
Boosting
Regression
SVM
ANN
Clustering
DM/PA Tools
R
SAS
Splus
SPSS
Statsoft
ODM
Data Storage/Management
Oracle
DB2
SQL
Server
MY
SQL
Weka
SAS
File
R File
Copyright © 2009 Business Data Miners.
3
Business Problems
Marketing
Fraud
LTV
Churn
Credit
Data Mining/Predictive Analytics Methods
Tree
Logistic
reg
Gradient
Boosting
Regression
SVM
ANN
Clustering
DM/PA Tools
R
SAS
Splus
SPSS
Statsoft
ODM
Data Storage/Management
Oracle
SQL
Server
DB2
MY
SQL
Copyright © 2009 Business Data Miners.
Weka
SAS
File
R File
4
Business Problems
Marketing
Fraud
LTV
Churn
Credit
Data Mining/Predictive Analytics Methods
Tree
Logistic
reg
Regression
SVM
Anomaly
Clustering
DM/PA Tools/Database
ODM
Oracle 11g
Copyright © 2009 Business Data Miners.
Copyright © Business Data Miners.
5
About Check Deposit
• When a check is deposited, two banks are
involved:
– Depositor’s bank and payer’s bank
– Usually the depositor’s bank makes the funds
available for the depositor to withdraw before the
payer’s bank actually transfers the funds to the
depositor’s bank.
Copyright © Business Data Miners.
6
About Check Deposit Fraud
• Fraudsters take advantage of this “gap”. They
deposit fraudulent checks to inflate their
account balances and then withdraw funds
quickly. The depositor’s bank suffers losses
when fraudulent checks returned unpaid by
the payer’s bank.
Copyright © Business Data Miners.
7
About Rule Based Deposit Fraud
Detection Systems
• Advantages
– Understandable by analysts.
– Make use of analysts’ experience and intuition
• Examples
• Rule 1: # of Check Deposits In Last 7 Days More than 35
• Rule 2: Total Amount of Check Deposits In Last 5 Days
Higher Than $15,000
Copyright © Business Data Miners.
8
Statistical Predictive Model
• Data driven.
• Built based on a large number of historical
cases.
• Advantages
–Handle many variables simultaneously and
normally detect more frauds with lower
false positives. This is illustrated in the next
slide.
Copyright © Business Data Miners.
9
Rule 1. High Dollar Amount
Dollar Amount of Transactions in Last 7 Days
Predictive Model that uses
two variables simultaneously
Rule 2. High
Frequency
Lowering Threshold Increases False Positive
Number of Transactions in Last 7 Days
Copyright © Business Data Miners.
10
The Goal of the Project
• Build a statistical model to predict returned
items (RDI).
• The work is experimental and the final
model has not been implemented
operationally.
• The work was done when Dr. Jay Zhou worked
for Citizens Bank as a VP. For business
questions, please contact Josh Ablett at
[email protected].
Copyright © Business Data Miners.
11
Major Steps
1.
2.
3.
4.
5.
6.
Data Gathering
Data Validation
Data Preparation
Feature Variable Calculation
Predictive Model Building/Testing
Predictive Model Deployment
Copyright © Business Data Miners.
12
Data Used
– Account Data
• Account open date
• Account open branch
• Account type
– Check Deposit (54 millions)
• Deposit Date
• Deposit Amount
• Paying Bank Number
• Receiving Bank Number
• Check Serial Number
– Third Party “Known Fraud Account” Hits
• A service provider has a repository of known fraudulent checks. A hit is
generated when a check’s paying bank number is found in the known
fraudulent check list.
Copyright © Business Data Miners.
13
Data Challenges
•
•
•
•
Security: account numbers, SSN, etc.
Large Volume: 54 million check items
Refresh Data Daily
Various Data Sources
– Text Files
– SQL Server
– Oracle DB
Copyright © Business Data Miners.
14
RDBMS Solution
• Oracle DB
•
•
•
•
Access is controlled: security
External table: query or load text file
heterogeneous services: load SQL Server table
Materialized view : refresh data based on
predefined schedules
• Table Partition, Index: handle large data
Copyright © Business Data Miners.
15
2. Data Validation
• SQL functions
–Check the number of null values
–Minimum, Maximum, Median values
–Number of distinctive values
–Number of records by date
- Histogram: using width_bucket()‫‏‬
- And More
Copyright © Business Data Miners.
16
2. Data Validation (continued)‫‏‬
• Metadata table
• Store variable statistics.
• Easier to handle hundreds of variables.
• We can query it to get the variables with desired
characteristics.
• Unique keys:
• Variables with names containing the word
“credit” and percentage of missing values less
than 15%
Copyright © Business Data Miners.
17
3. Data Preparation
• Merge transaction data with performance data
– Most of the time, there are no common keys to
uniquely link transactions and performance data.
• Unmatched cases
• Duplicates
– Iterative SQL joining
• Try many iterations of matching methods, from
“stronger” ones to “weaker” ones. Successfully
matched cases are removed from the “pools”.
Copyright © Business Data Miners.
18
3. Data Preparation (continued)
• Sampling
– Randomly Sample the data based on depositing account
numbers.
• Creating Training and Testing Sets
– Randomly split data into training and testing sets. There is no
account number overlap between two sets.
Copyright © Business Data Miners.
19
4. Feature Variable Calculation
• Based on the original variables existing in the
data, we are able to create more salient
variables that capture interesting transaction
patterns.
Copyright © Business Data Miners.
20
4. Feature Variable Calculation (Continued)‫‏‬
• Risk Group Variables
– We calculate historical fraud rate for each account
type. Then we group account types into a small
number of risk groups based on their historical
fraud rates. The group numbers are used as one of
input variables to the model.
– Use SQL “group by” to calculate fraud ratio.
Copyright © Business Data Miners.
21
4. Feature Variable Calculation (Continued)‫‏‬
• Transaction Time Series Variable
• Life Time Maximum Check Amount From Paying Account
– max(AMOUNT) over (partition by ABA_ROUTING_NUMBER
,PAYING_ACCOUNT order by TRANSACTION_DATE rows
between unbounded preceding and current row)
life_time_max_amount
• Number of Checks Deposited in Last Two Days
– count(1) over (partition by ACCOUNT_NUMBER order by
TRANSACTION_DATE range between 1 preceding and current
row) num_2d
Copyright © Business Data Miners.
22
Calculate Feature Variables
(Continued)
• It is helpful to create as many feature variables
as possible.
• Then we can use statistical methods to select a
small number of variables that are most
relevant to the our target variable, i.e.,
returned item.
Copyright © Business Data Miners.
23
Data Preparation/Feature Variable
Calculation Considerations ‫‏‬
• Use views or materialized views when
possible.
– Challenge caused by separation of data and
processing logic
– With a view or materialized view, the logic that is used to
produce the view can be retrieved using the following
query:
• select text from user_views where view_name='DATA_VIEW';
Copyright © Business Data Miners.
24
5.Predictive Model Building/Testing
• We used logistic regression models in Oracle 11g
ODM.
• Variables used in the model including:
– Account Type Risk Group
– Days Since Account Opened
– Check Serial Number Out of Sequence
– Third Party “Negative File” Hit
– Number of Checks Deposited In Past Two Days
– Account Open Branch Risk
– Item Amount vs. Maximum Check Amount from the
Paying Bank Account
Copyright © Business Data Miners.
25
5.Predictive Model Building/Testing
(Continued)‫‏‬
• The final model is stored as an Oracle mining
object.
• Prediction_Probability function: remove the
traditional bottleneck of model deployment.
– select a.id, prediction_probability(model_object,'1'
using *) score from table_data a;
Copyright © Business Data Miners.
26
6. Model Deployment
• All of the steps are implemented as
materialized views.
• Score calculation for new data is simply
refreshing a series of materialized views.
Copyright © Business Data Miners.
27
7. Reporting
• The checks are rank ordered based on the risk
scores. Analysts can view the report using a
web browser.
Copyright © Business Data Miners.
28
Model Score Performance
–If we maintain the same fraud
prevention staff level, it is
forecasted that the total fraud
loss prevented will be up 20%.
Copyright © Business Data Miners.
29
Drawbacks of In-Database Modeling
• Data miners are not self-sufficient
• Need support from database
administrator
Copyright © Business Data Miners.
30
Summary
• Predictive scoring model
– Significantly improve the check deposit fraud
detection performance.
• From data miner perspective
– Powerful integration of SQL and advanced
modeling functions
• Single environment
– Almost all the processes executed within a
database environment
– Increased security, productivity, manageability and
scalability.
Copyright © Business Data Miners.
31
Questions?
Copyright © Business Data Miners.