Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study Jay Zhou, Ph.D. Business Data Miners, LLC 978-726-3182 [email protected] Web Site: www.businessdataminers.com LinkedIn: www.linkedin.com/in/jiangzhou Copyright © Business Data Miners. 1 Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study • Four Components –Business: Check Fraud –Scoring Model: Logistic Regression –Modeling Tool: Oracle Data Mining –Data Storage/Management: Oracle Database Copyright © Business Data Miners. 2 Business Marketing Fraud LTV Churn Credit Data Mining/Predictive Analytics Methods Tree Logistic reg Gradient Boosting Regression SVM ANN Clustering DM/PA Tools R SAS Splus SPSS Statsoft ODM Data Storage/Management Oracle DB2 SQL Server MY SQL Weka SAS File R File Copyright © 2009 Business Data Miners. 3 Business Problems Marketing Fraud LTV Churn Credit Data Mining/Predictive Analytics Methods Tree Logistic reg Gradient Boosting Regression SVM ANN Clustering DM/PA Tools R SAS Splus SPSS Statsoft ODM Data Storage/Management Oracle SQL Server DB2 MY SQL Copyright © 2009 Business Data Miners. Weka SAS File R File 4 Business Problems Marketing Fraud LTV Churn Credit Data Mining/Predictive Analytics Methods Tree Logistic reg Regression SVM Anomaly Clustering DM/PA Tools/Database ODM Oracle 11g Copyright © 2009 Business Data Miners. Copyright © Business Data Miners. 5 About Check Deposit • When a check is deposited, two banks are involved: – Depositor’s bank and payer’s bank – Usually the depositor’s bank makes the funds available for the depositor to withdraw before the payer’s bank actually transfers the funds to the depositor’s bank. Copyright © Business Data Miners. 6 About Check Deposit Fraud • Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances and then withdraw funds quickly. The depositor’s bank suffers losses when fraudulent checks returned unpaid by the payer’s bank. Copyright © Business Data Miners. 7 About Rule Based Deposit Fraud Detection Systems • Advantages – Understandable by analysts. – Make use of analysts’ experience and intuition • Examples • Rule 1: # of Check Deposits In Last 7 Days More than 35 • Rule 2: Total Amount of Check Deposits In Last 5 Days Higher Than $15,000 Copyright © Business Data Miners. 8 Statistical Predictive Model • Data driven. • Built based on a large number of historical cases. • Advantages –Handle many variables simultaneously and normally detect more frauds with lower false positives. This is illustrated in the next slide. Copyright © Business Data Miners. 9 Rule 1. High Dollar Amount Dollar Amount of Transactions in Last 7 Days Predictive Model that uses two variables simultaneously Rule 2. High Frequency Lowering Threshold Increases False Positive Number of Transactions in Last 7 Days Copyright © Business Data Miners. 10 The Goal of the Project • Build a statistical model to predict returned items (RDI). • The work is experimental and the final model has not been implemented operationally. • The work was done when Dr. Jay Zhou worked for Citizens Bank as a VP. For business questions, please contact Josh Ablett at [email protected]. Copyright © Business Data Miners. 11 Major Steps 1. 2. 3. 4. 5. 6. Data Gathering Data Validation Data Preparation Feature Variable Calculation Predictive Model Building/Testing Predictive Model Deployment Copyright © Business Data Miners. 12 Data Used – Account Data • Account open date • Account open branch • Account type – Check Deposit (54 millions) • Deposit Date • Deposit Amount • Paying Bank Number • Receiving Bank Number • Check Serial Number – Third Party “Known Fraud Account” Hits • A service provider has a repository of known fraudulent checks. A hit is generated when a check’s paying bank number is found in the known fraudulent check list. Copyright © Business Data Miners. 13 Data Challenges • • • • Security: account numbers, SSN, etc. Large Volume: 54 million check items Refresh Data Daily Various Data Sources – Text Files – SQL Server – Oracle DB Copyright © Business Data Miners. 14 RDBMS Solution • Oracle DB • • • • Access is controlled: security External table: query or load text file heterogeneous services: load SQL Server table Materialized view : refresh data based on predefined schedules • Table Partition, Index: handle large data Copyright © Business Data Miners. 15 2. Data Validation • SQL functions –Check the number of null values –Minimum, Maximum, Median values –Number of distinctive values –Number of records by date - Histogram: using width_bucket() - And More Copyright © Business Data Miners. 16 2. Data Validation (continued) • Metadata table • Store variable statistics. • Easier to handle hundreds of variables. • We can query it to get the variables with desired characteristics. • Unique keys: • Variables with names containing the word “credit” and percentage of missing values less than 15% Copyright © Business Data Miners. 17 3. Data Preparation • Merge transaction data with performance data – Most of the time, there are no common keys to uniquely link transactions and performance data. • Unmatched cases • Duplicates – Iterative SQL joining • Try many iterations of matching methods, from “stronger” ones to “weaker” ones. Successfully matched cases are removed from the “pools”. Copyright © Business Data Miners. 18 3. Data Preparation (continued) • Sampling – Randomly Sample the data based on depositing account numbers. • Creating Training and Testing Sets – Randomly split data into training and testing sets. There is no account number overlap between two sets. Copyright © Business Data Miners. 19 4. Feature Variable Calculation • Based on the original variables existing in the data, we are able to create more salient variables that capture interesting transaction patterns. Copyright © Business Data Miners. 20 4. Feature Variable Calculation (Continued) • Risk Group Variables – We calculate historical fraud rate for each account type. Then we group account types into a small number of risk groups based on their historical fraud rates. The group numbers are used as one of input variables to the model. – Use SQL “group by” to calculate fraud ratio. Copyright © Business Data Miners. 21 4. Feature Variable Calculation (Continued) • Transaction Time Series Variable • Life Time Maximum Check Amount From Paying Account – max(AMOUNT) over (partition by ABA_ROUTING_NUMBER ,PAYING_ACCOUNT order by TRANSACTION_DATE rows between unbounded preceding and current row) life_time_max_amount • Number of Checks Deposited in Last Two Days – count(1) over (partition by ACCOUNT_NUMBER order by TRANSACTION_DATE range between 1 preceding and current row) num_2d Copyright © Business Data Miners. 22 Calculate Feature Variables (Continued) • It is helpful to create as many feature variables as possible. • Then we can use statistical methods to select a small number of variables that are most relevant to the our target variable, i.e., returned item. Copyright © Business Data Miners. 23 Data Preparation/Feature Variable Calculation Considerations • Use views or materialized views when possible. – Challenge caused by separation of data and processing logic – With a view or materialized view, the logic that is used to produce the view can be retrieved using the following query: • select text from user_views where view_name='DATA_VIEW'; Copyright © Business Data Miners. 24 5.Predictive Model Building/Testing • We used logistic regression models in Oracle 11g ODM. • Variables used in the model including: – Account Type Risk Group – Days Since Account Opened – Check Serial Number Out of Sequence – Third Party “Negative File” Hit – Number of Checks Deposited In Past Two Days – Account Open Branch Risk – Item Amount vs. Maximum Check Amount from the Paying Bank Account Copyright © Business Data Miners. 25 5.Predictive Model Building/Testing (Continued) • The final model is stored as an Oracle mining object. • Prediction_Probability function: remove the traditional bottleneck of model deployment. – select a.id, prediction_probability(model_object,'1' using *) score from table_data a; Copyright © Business Data Miners. 26 6. Model Deployment • All of the steps are implemented as materialized views. • Score calculation for new data is simply refreshing a series of materialized views. Copyright © Business Data Miners. 27 7. Reporting • The checks are rank ordered based on the risk scores. Analysts can view the report using a web browser. Copyright © Business Data Miners. 28 Model Score Performance –If we maintain the same fraud prevention staff level, it is forecasted that the total fraud loss prevented will be up 20%. Copyright © Business Data Miners. 29 Drawbacks of In-Database Modeling • Data miners are not self-sufficient • Need support from database administrator Copyright © Business Data Miners. 30 Summary • Predictive scoring model – Significantly improve the check deposit fraud detection performance. • From data miner perspective – Powerful integration of SQL and advanced modeling functions • Single environment – Almost all the processes executed within a database environment – Increased security, productivity, manageability and scalability. Copyright © Business Data Miners. 31 Questions? Copyright © Business Data Miners.