Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining In A Zero Latency Enterprise May 30, 2001 Philip Bosinoff Compaq Advanced Technology Center Agenda ZLE Introduction Compaq ZLE Initiative Asset Protection Data Mining Application Summary and Conclusions S G 2 6 2 Zero Latency Enterprise A definition: “Zero Latency is the real-time, enterprise-wide dissemination of new information distributed in such a way that allows businesses to react quickly to it, driving the competitive business advantage to its ultimate limits” Paul Larson Talarian Corporation The goal: “Instantaneous awareness and appropriate response to events across an entire enterprise” Roy Schulte Vice President, World Services Gartner Group S G 2 6 3 ZLE Retail CRM Example Making customer touchpoints Integrated z Responsive z Personal z Credit POS Customer Data Customer Data Gift Customer Data Single View of Customer In real time Customer Data Refunds Customer Customer Data Data eCommerce Call Center S G 2 6 4 ZLE Solution Challenges Synchronize information across various enterprise applications z Integrate data for the enterprise z Provide data to Business Registration Intelligence applications for analysis z Support operational applications needing access to up to the minute data from multiple systems z Property F&B ZLE Ticketing eCommerce S G 2 6 5 Agenda ZLE Introduction Compaq ZLE Initiative Asset Protection Data Mining Application Summary and Conclusions S G 2 6 6 The Compaq ZLE Program Business initiative to deliver complete solutions Partners z z SAS, Mercator, Trillium, Acxiom, Actional, Blaze, MicroStrategy, Protagona, Savant Cap Gemini, Lockheed, Deloitte, EDS, KPMG, … Initial industry targets z z z Retail (CRM) Telecommunications Finance S G 2 6 7 Compaq ZLE Architecture Appl Appl s er Ad ap t er s t ap Ad Appl Appl Appl Compaq ZLE Compaq ZLE ZLE Compaq Compaq Clip On ZLE ZLE Applications Adapters Analysis Core Compaq ZLE Analysis ZLE Compaq Compaq Clip On ZLE ZLE Applications Adapters Core Docking ISV’s Docking ISV’s Appl Appl Appl Core Services s er Ad ap t er s t ap Ad Appl Appl Operational Data Store Data models Data Mining and Analysis S G 2 6 8 ZLE Core Services Applications Transaction Clip On Applications Rules Transform Workflow Router/ inserter access Ad hoc queries ETL Application Server/TP Monitor (CORBA, Tuxedo, Java Frameworks) Cluster Aware DBMS SSI Clustered Operating System S G 2 6 9 Compaq ZLE Clip-on Services Services make use of the real time data and application integration capabilities Examples z z z z Interaction Manager – Personalization – Suggests appropriate customer responses Guest/customer Manager - Profiling and data enrichment – Synchronizes info across systems – Enables unified customer view Campaign Management Narrowcasting - Notification Campaign mgmt Interaction Manager ZLE Core Services Detailed Access Guest Manager Notification ODS S G 2 6 10 Proof point: An industry milestone Credit warehouse Rating engines ODS Compaq ProLiant™ 4 processors ProLiant 4 processors 100 billion CDRs Rolling 90 Days 1.2 billion CDRs per day Batch extra ct Mode l dep loyme nt or nit Mo Compaq NonStop™ Himalaya™ 128 processors 256 GB memory 110 TB a dat , h atc ation b niz line On nchro sy Compaq AlphaServer™ 8 processors, 1 TB Data mining warehouse ProLiant 2 x 2 cluster 768 GB MicroStrategy software jumbo parallel query ProLiant 4 processors ProLiant 4 processors Government compliance Status monitors Customer service S G 2 6 11 1,000 queries/second Role of Data Mining in ZLE Determine most effective responses to business events ZLE architecture facilitates mining by Providing a rich, integrated, current data source z Integrated operational systems and business processes z S G 2 6 12 Agenda ZLE Introduction Compaq ZLE Initiative Asset Protection Data Mining Application Summary and Conclusions S G 2 6 13 Background ZLE asset protection study conducted z z z Credit-card fraud application described z z SAS and Compaq analytical consultants Large click-and-mortar US retailer (> $30B) Sensitive topic, confidential results Representative of actual study Demonstrated at Compaq booth The opportunity z z z Use SAS Enterprise Miner to build predictive models Identify fraudulent private-label card purchases in real-time Potentially large ROI (e.g., .1% = $30M) S G 2 6 14 Credit Card Fraud Methods Cards stolen or numbers generated Cards used quickly for multiple purchases at multiple stores Typical items purchased Consumer electronics z Jewelry z Videos and CDs z High purchase amount, high min. item price Suspicious refund activity S G 2 6 15 Source Data For Modeling Fraud rate of .3% (.25% - 2% is typical) Stratified sampling used All fraud cases z Random sample of non-fraud cases z Case set variables Current purchase information z Historical purchase/refund measures z Account parameters z Fraud flag (target) z S G 2 6 16 Modeling in Enterprise Miner Nodes are linked into a process flow diagram Processing steps z z z z Input data set defined Data partitioned: train, validate, test Tree and neural network models built Model performance assessed S G 2 6 17 Tree Node Output Example rule: IF CardsIssued30days AND StorePurch7Days > 3 AND StoresVisited1Day > 2 AND ElectronicsPurch3Day > 2 AND PurchAmt > $75 THEN Probability of fraud is High S G 2 6 18 Measuring Model Effectiveness Predicted Precision = d/(b+d) Recall = d/(c+d) Specificity = a/(a+b) False positive = b/(a+b) False negative = c/(c+d) Accuracy = (a+d)/(a+b+c+d) Actual Not Fraud Fraud Not Fraud a b Fraud c d Precision: How many of our predicted frauds are actual? Recall: How many of the actual frauds do we catch? b c d S G 2 6 19 Enterprise Miner Threshold-Based Charts Easy way to visualize confusion matrices Classification threshold can be set interactively Generated report shows confusion matrices for a set of thresholds S G 2 6 20 Setting The Classification Threshold Can tradeoff recall for precision: No Fraud Fraud Predicted No Fraud Fraud 18907 832 1770 Actual Actual Predicted 840 Threshold = 40 Precision = .50 Recall = .32 No Fraud Fraud No Fraud Fraud 19558 181 2300 310 Threshold = 45 Precision = .63 Recall = .12 Less fraud caught, but fewer false positives S G 2 6 21 Model Deployment Threshold(s) and business processes determined Tree model converted to rules z Manual now, automated support coming Rules executed in ZLE Interaction Manager In real-time z Using integrated, detailed, current data z May be mixed with other business rules z Performance monitored z S G 2 6 22 Summary and Conclusions A ZLE system enables Rapid dissemination and integration of information z Real-time responses to business events z Data mining used to determine effective responses ZLE and data mining very synergistic ZLE solutions available from Compaq, SAS and partners S G 2 6 23 Data Mining in a Zero Latency Enterprise Philip Bosinoff and Michael Heytens Compaq Computer Corp. Abstract Customers expect companies to provide current and complete information around-the-clock, and interactions to be personalized, whether face-to-face, over the phone or on the Internet. A Zero Latency Enterprise (ZLE) solution from Compaq and several partners directly addresses this challenge by enabling the instantaneous dissemination of new information across an enterprise, and using this information to respond to business events in real-time in an effective and customized manner. In this paper we discuss the architecture of this ZLE solution. Then we describe in detail a ZLE data mining application using SAS Enterprise Miner to detect retail credit card fraud. This application is based on a fraud detection study done with a large U.S. retailer (greater than US$30 billion in revenue) and SAS analytical consultants. The current, comprehensive customer information available in a ZLE environment enables effective models to be built quickly in Enterprise Miner. The ZLE environment allows these models to be deployed easily into a business rules engine and executed against up-to-the-second information to detect fraudulent credit card purchases in real-time. Data mining, done in the context of a ZLE solution, enables companies to respond quickly and effectively to business events. 1. Introduction The Internet and e-commerce have transformed marketplace expectations. People want accurate, up-to-thesecond information and appropriate, instantaneous responses. Yet they also expect the same kind of personalized attention they get from face-to-face interactions. Companies doing business on the Internet must bring the entire enterprise up to Internet speed and find a way to personalize the e-commerce experience. This must be done in an increasingly competitive business environment, where the ability to react quickly to fast-changing market conditions while minimizing development and operations costs is required to compete effectively. To meet these challenges effectively, businesses must become what the Gartner Group has called a Zero Latency Enterprise (ZLE), which is an organization that is able to make new information available instantaneously across the enterprise and to use it to respond to business events in real-time and in an appropriate and customized way. Zero Latency Enterprise Solutions from Compaq and its partners directly address these challenges. The Compaq ZLE solution architecture provides these capabilities by integrating applications and data through a combination of EAI (Enterprise Application Integration) and ODS (Operational Data Store) technologies. We describe this architecture, and the motivating principles behind it, in the next section of the paper. The ZLE architecture has been applied successfully in a variety of industries, such as telecommunications, finance, and retail. In Section 3, we describe an implementation of this architecture for a retail environment, developed in collaboration with a large U.S. retailer (greater than US$30 billion in annual revenue). This implementation allows the retailer to have a single, current, integrated, enterprise-wide view of the customer, and to use this view to respond to customer interactions in real-time in an effective and personalized way. Many companies have partnered with Compaq in the development and delivery of ZLE solutions. One such partner is SAS. ZLE solutions utilize SAS Enterprise Miner for data mining, and other SAS products and services in various ways. SAS Enterprise Miner, and data mining in general, play a critical role in ZLE solutions by helping to understand and to determine the responses to business events that are the most appropriate and likely to be the most effective. A ZLE environment, in turn, greatly facilitates data mining by providing a rich, integrated data source, and a platform through which mining results, such as predictive models, can be deployed quickly and flexibly. There are many possible applications of data mining in a ZLE solution: personalizing offers at the e-store and other touchpoints; asset protection; campaign management; and real-time risk assessment. In Section 4, we describe one such application in detail: real-time credit card fraud detection in a retail environment. This application was developed in partnership with asset protection professionals from the large U.S. retailer mentioned previously. The paper ends in Section 5 with a summary and conclusions. 2. Compaq ZLE Solutions Traditional approaches to application and data integration There are two key capabilities inherent in a ZLE information system that pose significant technical challenges. The first key capability is integrating and instantaneously disseminating new information across an enterprise. The ability to do this successfully requires enterprise application integration (EAI), which is typically accomplished by solutions that “push” information from a source application to various targets in response to a business event. For example, if an order request were submitted to a Web-based order management system, EAI could be used to forward the request to a backend fulfillment system. To enable this kind of integration, an EAI solution must include adapters that allow various applications to plug into the EAI communications infrastructure, then business rules for specifying the logic associated with application-to-application interactions, and finally workflow management for managing the flow of interactions across application systems. Data transformation technologies are also required at various points in an EAI solution for mapping information from the structure and semantics of one application to another. The second key ZLE capability is the ability to use the disseminated and integrated information to respond in real-time to business events. This capability has traditionally been implemented via caching data from across the enterprise in a repository such as an operational data store (ODS), from which applications and users can then extract information to meet business needs. An ODS is fed in near real-time by other databases and applications in the enterprise, and it contains summarized data, like a data warehouse, as well as the more detailed transactional data generated by operational systems. Limitations of traditional integration approaches Both EAI and ODS technologies have been applied in the past in isolation to provide application and data integration, respectively. While traditional uses of these technologies have certainly been beneficial, they fall far short of meeting the ZLE challenge described earlier. EAI technologies, for example, aren’t designed to be available 24-by-7, or to handle high-volume event environments, such as the hundreds, or even thousands of events per second in retail point-of-sale (POS) and e-store clickstream applications. Also, EAI solutions typically provide very inefficient mechanisms for retrieving information from across an enterprise, a key required capability in a ZLE solution. Finally, EAI by itself has no persistent long-term storage, another required ZLE capability. An ODS solution also falls short of meeting the ZLE challenge, for several reasons. First, it provides only data integration and does not address the application integration issue at all. Second, once written to the ODS, data is typically not updateable, an important capability for maintaining an integrated and consistent view of the enterprise. Finally, while an ODS is more operationally focused than, say, a data warehouse, the data in an ODS is usually not detailed enough to provide actual operational support for many enterprise applications. ZLE solution architecture The architectural approach taken in Compaq ZLE solutions is to combine EAI and ODS technologies, retaining the benefits of each, and using the two in combination to address the shortcomings discussed above. This approach is shown in Figure 1 below. Figure 1. Compaq Zero Latency Enterprise Solutions combine ODS and EAI technologies, with the NonStop™ solutions integrator at the hub. The EAI layer, in the form of the NonStop™ solutions integrator, includes adapters that support a variety of application-to-application communications schemes, including messages, transactions, objects, and database access. The ODS layer contains a cache of data from across the enterprise, which is updated directly and in near real-time by application systems, or indirectly through the EAI layer. In addition to the EAI and ODS hub of a Compaq ZLE solution shown in Figure 1 above, there are several other key architectural components. One is a set of analysis marts, for doing data mining and other forms of business intelligence, such as OLAP. These marts are fed data from the ODS, and the results of any analysis performed in these marts are deployed back into the ZLE hub for use in operational systems. Another important component, for customer-focused ZLE solutions, is the Customer Manager. This component is responsible for maintaining a single, enriched and enterprise-wide view of the customer. The tasks performed by the Customer Manager include: de-duplication of customer information (e.g., recognizing duplicate customer information resulting from minor spelling differences), propagating changes to customer information to the ODS and all affected applications, and enriching in-house data with third-party demographics, psychographics and other kinds of information. Another architectural component for customer-focused ZLE solutions is the Interaction Manager. This component is responsible for recommending appropriate responses to customer interactions, based on the information maintained by the Customer Manager and other information in the ODS. The Interaction Manager is often the vehicle through which data mining results, such as predictive models, are deployed. The Interaction Manager is designed to support a flexible and expressive set of business rules. Data mining in a ZLE environment Data mining techniques and the ZLE solution architecture described above are very synergistic, in the sense that data mining plays a key role in the overall solution, and the ZLE solution infrastructure, in turn, greatly facilitates data mining. Data mining’s role is to help identify and understand the most effective ways to respond to business events based on historical data. For example, an e-store clickstream can be analyzed, and the factors (navigation, previous purchase patterns, etc.) associated with visitors that tend to buy certain kinds of products identified. These factors can then be used in the Interaction Manager to determine the most appropriate ads, offers and content to display to future e-store visitors. In this manner, data mining directly supports the essence of a ZLE---responding in an appropriate and customized way to business events, based on integrated and current data from across an enterprise. The Compaq ZLE solution architecture, greatly facilitates data mining by (1) performing much of the data preparation work for mining and (2) integrating many of the business processes and operational systems required for the effective deployment of mining results. For example, the deployment of a model that predicts, say, whether or not a customer will respond to an e-store offer, may require gathering customer attributes such as demographics, purchase history, browse history and so on, from a variety of systems. In a ZLE environment, this task is greatly simplified, because all this information is in the ODS in an integrated and current form. 3. A ZLE Solution for Retail CRM In the previous section, we saw the Compaq ZLE solution architecture. In this section, we discuss the application of a ZLE solution to customer relationship management (CRM) in the retail industry, then describe an actual implementation of this architecture developed in partnership with a large U.S. retailer. New challenges for retailers Neighborhood storeowners know their customers. When customers visit the store, the owner suggests products likely to appeal. This kind of personalized service results in great customer loyalty, a cornerstone of every retailer’s success. On the Internet and in today’s large retail chains, though, maintaining customer loyalty through personalized service is much more challenging. In these environments, building a deep understanding of customer preferences and needs is difficult, because the interactions that provide this information are scattered across separate systems for sales, marketing, service, returns, credit cards, and so on. Also, customers have many choices and can easily shop elsewhere. To keep customers coming back, retailers of all sizes need to find a way to recapture the personal touch. They need comprehensive knowledge of the customer that encompasses the customer’s entire relationship with the retailer. Equally important is the ability to act on that customer knowledge instantaneously—for example, by making personalized offers during every customer interaction, no matter how brief. Obtaining a complete customer view A key element of interacting with customers in a personalized way is having available a single, complete, current, enterprise-wide view of the customer. This kind of customer view is simply not present in most retail environments today. Retailers typically have a very fragmented view of customers resulting from the separate and often incompatible computer systems for gift registry, credit card, returns, POS, e-store, and so on. So, for example, if a customer attempts to return an item a few days after the return period, the refund-desk representative that handles the return likely has no way of knowing if the customer is loyal and profitable, thus warranting some leniency. Similarly, if a customer has just purchased an item, the marketing department is not made aware that the customer should not be sent discount offers for that item in the future. A ZLE solution integrates all customer information from all channels, enabling retailers to make effective, personalized offers at every customer touchpoint: the brick-and-mortar store, call center, or online e-store. For example, an e-store customer who has just purchased gardening supplies at the brick-and-mortar store can be offered complementary outdoor products the next time she visits the website. A ZLE retail implementation The retail ZLE implementation developed by Compaq, in partnership with a large retail customer and various partner companies, consists of a framework with many components. These components can be assembled, based on customer requirements and preferences, into a retail ZLE solution. The major components in this solution framework are listed in Table 1 below. Function Operational data store Application integration Integration of disparate databases and systems Integration of external demographic data Customer consolidation and de-duplication Business rules definition and execution Call center Customer campaign management Data mining Broadcast of alerts E-store front end Component NonStop™ Himalaya™ servers with NonStop™ SQL database or AlphaServer systems with Oracle 8i™ database Mercator Business Broker or Compaq BusinessBus software Common Object Request Broker Architecture (CORBA) via NonStop™ DOM/MP and IBM MQSeries software Acxiom InfoBase software Harte-Hanks Trillium or Acxiom AbiliTec software Blaze Advisor Solutions Suite software Siebel Call Center software RSI Protagona software SAS Enterprise Miner software on Compaq AlphaServer™ systems MicroStrategy Broadcaster software Microsoft Site Server Commerce Edition software and Fast Start services from Compaq Table 1. Functions and components for a retail ZLE solution. Several elements of the ZLE solution architecture were described in the previous section. Let’s look at each of these elements and how they relate to the functions and components listed in Table 1. The core ODS and EAI architectural components are implemented by a Compaq Himalaya™ server with the NonStop™ SQL database or an AlphaServer system with Oracle 8i™ (ODS), and by Mercator’s Business Broker or Compaq’s BusinessBus (EAI). Additional integration is achieved through the use of CORBA technology and IBM’s MQSeries software. The Customer Manager, which maintains in the ODS a complete and current customer view, utilizes Acxiom’s InfoBase software to enrich internal customer information with demographics and psychographics. Consolidation and de-duplication of customer data is achieved via either Harte-Hanks’s Trillium or Acxiom’s AbiliTec software. The Interaction Manager uses the Blaze Advisor Solutions Suite software, which includes a Java-based rules engine, for the definition and execution of business rules. The Interaction Manager suggests appropriate responses to e-store visitor clicks, calls to the call center, point-of-sale purchases, refunds, and a variety of other interactions across a retail enterprise. Data mining analysis is performed via SAS Enterprise Miner running on a Compaq AlphaServer™ system. Source data for mining analysis is extracted from the ODS and moved to the mining platform. The results of any mining analysis, such as predictive models, are deployed into the rules engine inside the Interaction Manager or directly into ZLE applications. The ability to mix patterns discovered by sophisticated mining analyses with business rules-of-thumb, policies, etc. inside the Interaction Manager is very powerful. There are lots of potential applications of data mining in a ZLE retail environment: e-store cross-sell and up-sell; real-time fraud detection, both in physical stores and e-stores; campaign management; and making personalized offers at all touchpoints. In the next section, we will take an in-depth look at one of these applications, real-time fraud detection. 4. An Asset Protection Data Mining Application Compaq and SAS consultants conducted an in-depth study with a large U.S. retailer of how to apply data mining technology to the problem of detecting check fraud. Due to the sensitive nature of the check-fraud study, the information that we can reveal about it is very limited. However, we have created a ZLE credit card fraud demonstration, based on the check fraud study, that does not disclose any confidential information. We will describe this credit card fraud demonstration in this paper. The Opportunity Data mining techniques provide an opportunity to detect fraud in the use of company issued credit cards – fraud which otherwise would go undetected at the time of infraction. A strong business case exists to add ZLE based data mining to a retailer’s asset protection program. Even though typical retail credit card fraud rates are relatively small - in the .25 to 2% range – for a large retailer, even a small reduction in fraud translates to millions of dollars saved per year. We expect that most modern retailers do use some type of empirically-derived rules, or even predictive mining models as part of their asset protection program. In either case, predictions are probably made based on a very narrow customer view. The ZLE advantage is that models trained on current and comprehensive customer information can utilize up-to-the-second information to make real-time predictions. For example, in this paper, we are discussing credit cards which are owned by the retailer, e.g. in-house credit cards, not cards produced by a third party or bank. The card itself is branded with the retailer’s name. In this case, the retailer has payment history and purchase history information for the consumer – an ideal situation for data mining with ZLE. Source Data As discussed above, in Sections 2 and 3, all source data is contained in the ODS. As such, much of the data preparation phase of standard data mining has already been accomplished. The cleaned, disparate sourced, de-duped, demographically enriched data is ready to mine. Successful data mining for fraud detection requires the creation of a case set with carefully chosen variables, and derived variables. Note that we use the term variable to mean the same as attribute, column, or field. Each row in the demo data set describes the status of one credit card account. Each row can be thought of as a case, and the goal of the data mining exercise is to find patterns that differentiate the fraud and nonfraud cases. The demo data set is referred to as a case set. Credit card fraud rates are typically in the range of about .25% to 2%. For model building, it is important to boost the percentage of fraud in the case set to the point where the ratio of fraud to non-fraud cases is higher, to as much as 50%. The reason for this is that if there are relatively few cases of fraud in the training set, the model building algorithms will have difficulty finding patterns in the data. The data set used in the eCRM ZLE demonstration contains approximately 1 million sample records, with each record describing the purchase activity of a customer on a company credit card. For the purposes of this paper, each row in the case set represents aggregate customer account activity over some reasonable time period such that it makes sense for this account to be classified as fraudulent or non-fraudulent. This was done out of convenience due to our customer-centric view for demonstration purposes of the ZLE environment. Real world case sets would more typically have one row per transaction, each row being identified as a fraudulent or non-fraudulent transaction. The number of fraud cases, or records, is approximately 125K, which translates to a fraudulent account rate of about .3% (125K out of the 40M guests in the complete eCRM demo database). Note how low this rate is: much less than 1%. All 125K fraud cases (i.e., customers for which credit-card fraud occurred) are in the case set, along with a sample of approximately 875K non-fraud cases. Both the true fraud rate (.3%) and the ratio of non-fraud to fraud cases (roughly 7 to 1) in the case set are typical of what is found in real fraud detection studies. The demo data set is a synthetic one, in which we planted several patterns (described in detail below) associated with fraudulent credit card purchases. We accounted for the difference between the true population fraud rate of 0.3% and the sample fraud rate of 12.5% by using the Prior probability feature of Enterprise Miner - a feature expressly designed for this purpose. Enterprise Miner (EM) allows the user to set the true population probability of the rare target event. EM then automatically takes this into consideration in all model assessment calculations. This is discussed in more detail below in the model deployment section of the paper. The demonstration case set contained the following fields: • • • • • • • • RAC30: number of cards reissued in the last 30 days. TSPUR7: total number of store purchases in the last 7 days. TSRFN3: total number of store refunds in the last 3 days. TSRFNV1: total number of different stores visited for refunds in the last 1 day. TSPUR3: total number of store purchases in the last 3 days. NSPD83: normalized measure of store purchases in department 8 (electronics) over the last 3 days. This variable is normalized in the sense that it is the number of purchases in department 8 in the last 3 days, divided by the number of purchases in the same department over the last 60 days. TSAMT7: total dollar amount spent in stores in the last 7 days. FRAUDFLAG: target variable. The first seven are independent variables (i.e., the information that will be used to make a fraud prediction), and the eighth is the dependent or target variable (i.e., the outcome being predicted). Note that building the case set requires access to detailed, transaction-level data (e.g., to determine NSPD83) and data from multiple customer touchpoints (RAC30, which would normally be stored in a credit card system, and variables such as TSPUR7, that describe in-store POS activity, would be stored in a different system). Also, the case set includes data up to the current day. The task of building an up-to-date data set from multiple systems is facilitated greatly in a ZLE environment, but much more difficult in other environments. Note that RAC30, TSPUR7, TSRFN3, TSRFNV1, TSPUR3, NSPD83, and TSAMT7 are “derived” variables. The ODS does not carry this information in exactly this form. These records were created by calculation from other existing fields. An appropriate set of SQL queries is one common way to create the case set. Credit Card Fraud Methods According to the asset protection professionals that participated in the study, one technique used to commit fraud begins by stealing a newly issued credit card. For example, a store may send out a new card to a customer and a thief may steal it out of the customer’s mailbox. The data set contains a variable that describes whether or not cards have been reissued recently (RAC30). Once a thief gets a stolen credit card, he or she typically uses it frequently in a short period of time, e.g., in 1-7 days, before the stolen card is reported and stops being accepted. The data set contains variables that describe the total number of store purchases over the last 3 and 7 days, and the total amount spent in the last 7 days. Credit card thieves also tend to buy small expensive things, like consumer electronics. These are evidently desirable either for personal use by the thief, or because they are easy to sell “on the street”. The variable NSPD83 is a measure of the history of electronics purchases. Finally, thieves sometimes return merchandise bought with a stolen credit card for a cash refund. One technique for doing this is to use a fraudulent check to get a positive balance on a credit card, then items are bought and returned. Because there is a positive balance on the card used to purchase the goods, a cash refund is issued. (Seems like a questionable business practice to refund cash for something bought on a credit card, but evidently some stores do this!) Thieves often refund merchandise at different stores in the same city, to lower the chance of being caught. The data set contains several measures of refund activity. To summarize, the purchase patterns associated with a stolen credit card are: lots of purchases in a short period of time, high total dollar amount, cards recently reissued, purchases of electronics, suspicious refund activity, and so on. These are some of the patterns that the models built in the demonstration will detect. Modeling SAS Enterprise Miner supports a visual programming model, where nodes, which represent various processing steps, are connected together into process flows. The demonstration process flow diagram contains the following nodes: Figure 2. Demonstration process flow diagram. Note that Enterprise Miner version 3 was used in this demonstration. Version 4 has since been released, but the principles are the same. The goal of the analysis is to build a model that predicts credit card fraud. The Enterprise Miner interface allows for quick model creation, and easy comparison of model performance. Here is an example of the lower right portion of the output from the Tree node: Figure 3. Tree node output. The various paths through the tree, and the IF-THEN rules associated with them, describe the fraud patterns associated with credit card fraud. One interesting path through the tree is: If Cards reissued in last 30 days and Total store purchases in last 7 days > 1 and Number of different stores visited for refunds in current day > 1 and Normalized number of purchases in electronics dept. in last 3 days > 2 Then Probability of fraud is HIGH As described above, the conditions in this rule identify some of the telltale signs of credit card fraud, resulting in a prediction of fraud with high probability. The leaf node corresponding to this tree has a high concentration of fraud (approximately 80% fraud cases, 20% non-fraud) in the training and validation sets. (The first column of numbers shown on this and other nodes in the tree describes the training set, and the second column the validation set.) Note that the leaf nodes are color coded, with red nodes containing relatively little fraud, and green the most. Another rule in the tree is: If Cards reissued in last 30 days and Total store purchases in last 7 days > 1 and Number of different stores visited for refunds in current day > 1 and Normalized number of purchases in electronics dept. in last 3 days <= 2 and Total amount of store purchases in last 7 days >= 700 Then Probability of fraud is HIGH This is similar to the previous rule, except that fewer electronics items are purchased, but the total dollar amount purchased in the last 7 days is relatively large (at least $700). An alternative data mining model, produced by the neural network node in Enterprise Miner, gives very comparable results. In fact, the relative performance of these two classic data mining tools was very similar – even though the approaches are completely different. It is possible that tweaking the parameters of the neural network model might have given us a more powerful tool for fraud prediction, but this was not done during this study. Prediction logic is apparent and easily understood, in the form of IF-THEN rules, in the decision tree model. Contrast that with the neural network model, which basically uses a complex function of the input variables to estimate the fraud probability. Understanding exactly how a model is making its predictions is often important to business users. We found this to be the case with the asset protection personnel of this major retailer. In addition, there are potential legal issues – it may be that a retailer cannot deny service to a customer without a clear English explanation – something that is not possible with a neural network model. Model Assessment The best way to assess the value of these data mining models is a profit matrix, a variant of a “confusion matrix” which details the expected benefit of using the model, as broken down by the types of prediction errors that can be made. The classic confusion matrix is a simple 2x2 matrix assessing the performance of the data mining model by examining the frequency of classification successes/errors. Ideally, this is done with a holdout test data set, one that has not been used or looked at in any way during the model creation phase. The data mining model calculates an estimate of the probability that the target variable, fraud in our case, is true. When using a decision tree model, all of the samples in a given node of the resulting tree have the same predicted probability of fraud associated with them. When using the neural network model, each sample may have its own unique probability estimate. A business decision is then made to determine a cutoff probability. Samples with probability higher than the cutoff are predicted fraudulent, samples below the cutoff are predicted as non-fraudulent. Since we oversampled the data, there are actually two probabilities involved: the prior probability and the posterior probability of fraud. The prior represents the true proportion of fraud cases in the total population – a number often less than 1%. The posterior represents the proportion of fraud in the oversampled case set – as much as 50%. After setting up Enterprise Miner’s prior probability of fraud for the target variable to reflect the true population probability, Enterprise Miner adjusts all output tables, trees, charts, graphs, etc. to show results as though no oversampling had occurred – scaling all output probabilities and counts to reflect how they would appear in the actual (prior) population. Enterprise Miner’s ability to specify the Prior probability of the target variable is a very beneficial feature for the user. For easy reference, here is the confusion matrix, in general: Actual Predicted 0 1 0 True negatives False positives 1 False negatives True positives Figure 4. Confusion matrix. The entries in the cells are usually counts. Ratios of various counts and/or sums of counts are often calculated to compute various figures of merit for the performance of the prediction/classification algorithm. Consider a very simple algorithm, requiring no data mining – that of simply deciding that all cases are not fraudulent. This represents a baseline model with which to compare our data mining models. Here is the resulting confusion matrix: Actual Predicted 0 1 0 997,000 0 1 3000 0 Figure 5. Confusion matrix for a model that always predicts no fraud. This extremely simple algorithm would be correct 99.7% of the time! But no fraud would ever be detected. It has a hit rate of 0%. To improve on this, we must predict some fraud. Inevitably, doing so will increase the false positives as well. Here is an example confusion matrix, for some assumed cutoff, showing sample counts for holdout test data: Actual Predicted 0 1 0 994,500 2500 1 1800 1200 Figure 6. Confusion matrix for a mining model. The choice of cutoff is a very important business decision. In reviewing the results of the study with this major retailer, it became extraordinarily clear that this decision as to where to place the cutoff makes all the difference between a profitable and not so profitable asset protection program. Let’s examine the example confusion matrix presented above in more detail. Here are some summary statistics from this one table: (note that positives = frauds) Assessment Measure Value Number of total samples 1,000,000 Number of actual frauds 3000 Calculated percentage of fraud .3% Accuracy = fraction classified correctly 99.6% Sensitivity = Recall = “hit rate” = true positives/actual positives 1200/3000 = 40% Precision = positive predicted value (PV+) = true positives/predicted positives 1200/3700 = 32.4% Specificity = true negatives / actual negatives 99.75% Table 2. Assessment measures for Figure 6 mining model. Remarkably, even though the accuracy of the model is extremely good – the model classifies 99.6% of holdout case set samples correctly - the Recall and Precision are not nearly as good, 40% and 32% respectively. This is a common situation when data mining for fraud detection, or any low probability event situation. As a business decision, the major retailer can decide to alter the probability threshold (cutoff) in the model – that point at which a sample is considered fraudulent vs. not. Using the very same decision tree or neural network, a different confusion matrix results. For example, if they increase the cutoff probability, they will have fewer hits (predict fewer frauds). The confusion matrix might look like this: Actual Predicted 0 1 0 996,850 150 1 2400 600 Figure 7. Confusion matrix for a higher cutoff probability. The hit rate, or sensitivity, is 600/3000=20%, half as good as the previous cutoff. However, the precision has improved from 32% to 80%. Fewer false positives means fewer customers getting angry because they’ve falsely been accused of fraudulent behavior. The expense of this benefit comes in the form of less fraud being caught. Model Deployment To make a proper determination about where to place the cutoff, the retailer needs to compare costs involved with turning away good customers to margin lost on goods stolen through genuine credit card fraud. A significant issue, which came up in discussions with the retailer, was determining the best way to deploy the fraud prediction. Since the Compaq ZLE solution makes a determination of fraud immediately at the time of the transaction, if the data mining model predicts a given transaction is with a fraudulent card, various incentives to disallow the transaction can be initiated – without necessarily outright denial. In other words, measures need to be taken which discourage further fraudulent use of the card, but which will not otherwise be considered harmful to the customer who is not committing any fraud whatsoever. Examples of this might be asking to see another form of identification, (if the credit card is being used in a brick and mortar venue), or asking for further reference information from the customer if it is an e-store transaction. Once a data mining model is built, the model output is converted to rules. Those rules are entered into the ZLE business rules engine inside the Interaction Manager. These rules are mixed with other kinds of rules, such as policies, as well. Note that decision tree results are already in essential rule form – if then statements that are a function of the structure of the leaves and nodes of the tree. Neural net output can also be placed in the rules engine by simply creating a calculation rule which applies the neural network to the requisite variables, generating a fraud/no fraud prediction. For example, JAVA code performing the necessary calculations on the input variables could be generated by Enterprise Miner and inserted directly into the ZLE Interaction Manager. 5. Summary and Conclusions In today’s demanding business environment, customers expect current and complete information to be available continuously, and interactions of all kinds to be customized and appropriate. An organization must be able to disseminate new information instantaneously across the enterprise and use it to respond appropriately and in real-time to business events. In this paper, we described a ZLE solution architecture from Compaq and partner companies that directly addresses these challenges, and an implementation of this architecture for CRM in a retail environment. Data mining technology and this solution architecture are very synergistic. Data mining plays the key role in a ZLE solution of helping to understand and to determine the best ways to respond to business events. The ZLE solution infrastructure, in turn, greatly facilitates mining by providing an integrated, data-rich environment. We described in detail a ZLE data mining application that uses SAS Enterprise Miner to detect retail credit card fraud. The current, comprehensive customer information available in a ZLE environment allows effective models to be built quickly in Enterprise Miner. The ZLE environment allows these models to be deployed easily and used in real-time to detect fraudulent credit card purchases. Data mining, done in the context of a ZLE solution, enables companies to respond quickly and effectively to business events. Contact Information: Philip R. Bosinoff: [email protected] Michael L. Heytens: [email protected] SAS and Enterprise Miner are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.