Download Data Mining In a Zero Latency Enterprise

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining In A Zero
Latency Enterprise
May 30, 2001
Philip Bosinoff
Compaq
Advanced
Technology
Center
Agenda
ZLE Introduction
‹ Compaq ZLE Initiative
‹ Asset Protection Data Mining Application
‹ Summary and Conclusions
‹
S
G
2
6
2
Zero Latency Enterprise
A definition:
“Zero Latency is the real-time, enterprise-wide
dissemination of new information distributed in
such a way that allows businesses to react
quickly to it, driving the competitive business
advantage to its ultimate limits”
Paul Larson
Talarian Corporation
The goal:
“Instantaneous awareness and appropriate
response to events across an entire
enterprise”
Roy Schulte
Vice President, World Services
Gartner Group
S
G
2
6
3
ZLE Retail CRM Example
‹
Making customer touchpoints
Integrated
z Responsive
z Personal
z
Credit
POS
Customer
Data
Customer
Data
Gift
Customer
Data
Single View of Customer
In real time
Customer
Data
Refunds
Customer
Customer
Data
Data
eCommerce
Call Center
S
G
2
6
4
ZLE Solution Challenges
Synchronize information
across various enterprise
applications
z Integrate data for the
enterprise
z Provide data to Business
Registration
Intelligence applications
for analysis
z Support operational
applications needing
access to up to the minute
data from multiple systems
z
Property
F&B
ZLE
Ticketing
eCommerce
S
G
2
6
5
Agenda
ZLE Introduction
‹ Compaq ZLE Initiative
‹ Asset Protection Data Mining Application
‹ Summary and Conclusions
‹
S
G
2
6
6
The Compaq ZLE Program
Business initiative to deliver
complete solutions
‹ Partners
‹
z
z
‹
SAS, Mercator, Trillium, Acxiom,
Actional, Blaze, MicroStrategy,
Protagona, Savant
Cap Gemini, Lockheed, Deloitte,
EDS, KPMG, …
Initial industry targets
z
z
z
Retail (CRM)
Telecommunications
Finance
S
G
2
6
7
Compaq ZLE Architecture
Appl
Appl
s
er
Ad
ap
t
er
s
t
ap
Ad
Appl
Appl
Appl
Compaq
ZLE
Compaq
ZLE
ZLE
Compaq
Compaq
Clip
On
ZLE
ZLE
Applications
Adapters
Analysis
Core
Compaq
ZLE
Analysis
ZLE
Compaq
Compaq
Clip
On
ZLE
ZLE
Applications
Adapters
Core
Docking
ISV’s
Docking ISV’s
Appl
Appl
Appl
Core Services
s
er
Ad
ap
t
er
s
t
ap
Ad
Appl
Appl
Operational Data Store
Data
models
Data Mining and Analysis
S
G
2
6
8
ZLE Core Services
Applications
Transaction
Clip On Applications
Rules
Transform Workflow
Router/
inserter
access
Ad hoc
queries
ETL
Application Server/TP Monitor (CORBA, Tuxedo, Java Frameworks)
Cluster Aware DBMS
SSI Clustered Operating System
S
G
2
6
9
Compaq ZLE Clip-on Services
Services make use of the real time
data and application integration
capabilities
‹ Examples
‹
z
z
z
z
Interaction Manager – Personalization
– Suggests appropriate customer
responses
Guest/customer Manager - Profiling
and data enrichment
– Synchronizes info across systems
– Enables unified customer view
Campaign Management
Narrowcasting - Notification
Campaign
mgmt
Interaction
Manager
ZLE Core
Services
Detailed
Access
Guest
Manager
Notification
ODS
S
G
2
6
10
Proof point: An industry milestone
Credit
warehouse
Rating engines
ODS
Compaq
ProLiant™
4 processors
ProLiant
4 processors
100 billion CDRs
Rolling 90 Days
1.2 billion
CDRs per day
Batch
extra
ct
Mode
l dep
loyme
nt
or
nit
Mo
Compaq NonStop™
Himalaya™
128 processors
256 GB memory
110 TB
a
dat
,
h
atc ation
b
niz
line
On nchro
sy
Compaq AlphaServer™
8 processors, 1 TB
Data mining
warehouse
ProLiant
2 x 2 cluster
768 GB
MicroStrategy
software
jumbo parallel query
ProLiant
4 processors
ProLiant
4 processors
Government
compliance
Status monitors
Customer service
S
G
2
6
11
1,000 queries/second
Role of Data Mining in ZLE
Determine most effective responses to business
events
‹ ZLE architecture facilitates mining by
‹
Providing a rich, integrated, current data source
z Integrated operational systems and business
processes
z
S
G
2
6
12
Agenda
ZLE Introduction
‹ Compaq ZLE Initiative
‹ Asset Protection Data Mining Application
‹ Summary and Conclusions
‹
S
G
2
6
13
Background
‹
ZLE asset protection study conducted
z
z
z
‹
Credit-card fraud application described
z
z
‹
SAS and Compaq analytical consultants
Large click-and-mortar US retailer (> $30B)
Sensitive topic, confidential results
Representative of actual study
Demonstrated at Compaq booth
The opportunity
z
z
z
Use SAS Enterprise Miner to build predictive models
Identify fraudulent private-label card purchases in real-time
Potentially large ROI (e.g., .1% = $30M)
S
G
2
6
14
Credit Card Fraud Methods
Cards stolen or numbers generated
‹ Cards used quickly for multiple purchases at
multiple stores
‹ Typical items purchased
‹
Consumer electronics
z Jewelry
z Videos and CDs
z
High purchase amount, high min. item price
‹ Suspicious refund activity
‹
S
G
2
6
15
Source Data For Modeling
Fraud rate of .3% (.25% - 2% is typical)
‹ Stratified sampling used
‹
All fraud cases
z Random sample of non-fraud cases
z
‹
Case set variables
Current purchase information
z Historical purchase/refund measures
z Account parameters
z Fraud flag (target)
z
S
G
2
6
16
Modeling in Enterprise Miner
Nodes are linked into a
process flow diagram
‹ Processing steps
‹
z
z
z
z
Input data set defined
Data partitioned: train,
validate, test
Tree and neural
network models built
Model performance
assessed
S
G
2
6
17
Tree Node Output
‹
Example rule:
IF CardsIssued30days AND
StorePurch7Days > 3 AND
StoresVisited1Day > 2 AND
ElectronicsPurch3Day > 2 AND
PurchAmt > $75
THEN Probability of fraud is High
S
G
2
6
18
Measuring Model Effectiveness
Predicted
Precision = d/(b+d)
‹ Recall = d/(c+d)
‹ Specificity = a/(a+b)
‹ False positive = b/(a+b)
‹ False negative = c/(c+d)
‹ Accuracy =
(a+d)/(a+b+c+d)
Actual
‹
Not Fraud
Fraud
Not
Fraud
a
b
Fraud
c
d
Precision: How many of our predicted frauds are actual?
Recall: How many of the actual frauds do we catch?
b
c
d
S
G
2
6
19
Enterprise Miner Threshold-Based Charts
Easy way to visualize
confusion matrices
‹ Classification threshold can
be set interactively
‹ Generated report shows
confusion matrices for a set
of thresholds
‹
S
G
2
6
20
Setting The Classification Threshold
‹
Can tradeoff recall for precision:
No
Fraud
Fraud
Predicted
No
Fraud
Fraud
18907 832
1770
Actual
Actual
Predicted
840
Threshold = 40
Precision = .50
Recall = .32
‹
No
Fraud
Fraud
No
Fraud
Fraud
19558 181
2300
310
Threshold = 45
Precision = .63
Recall = .12
Less fraud caught, but fewer false positives
S
G
2
6
21
Model Deployment
Threshold(s) and business processes determined
‹ Tree model converted to rules
‹
z
‹
Manual now, automated support coming
Rules executed in ZLE Interaction Manager
In real-time
z Using integrated, detailed, current data
z May be mixed with other business rules
z Performance monitored
z
S
G
2
6
22
Summary and Conclusions
‹
A ZLE system enables
Rapid dissemination and integration of information
z Real-time responses to business events
z
Data mining used to determine effective responses
‹ ZLE and data mining very synergistic
‹ ZLE solutions available from Compaq, SAS and
partners
‹
S
G
2
6
23
Data Mining in a Zero Latency Enterprise
Philip Bosinoff and Michael Heytens
Compaq Computer Corp.
Abstract
Customers expect companies to provide current and complete information around-the-clock, and
interactions to be personalized, whether face-to-face, over the phone or on the Internet. A Zero
Latency Enterprise (ZLE) solution from Compaq and several partners directly addresses this
challenge by enabling the instantaneous dissemination of new information across an enterprise, and
using this information to respond to business events in real-time in an effective and customized
manner. In this paper we discuss the architecture of this ZLE solution. Then we describe in detail
a ZLE data mining application using SAS Enterprise Miner to detect retail credit card fraud.
This application is based on a fraud detection study done with a large U.S. retailer (greater than
US$30 billion in revenue) and SAS analytical consultants. The current, comprehensive customer
information available in a ZLE environment enables effective models to be built quickly in
Enterprise Miner. The ZLE environment allows these models to be deployed easily into a
business rules engine and executed against up-to-the-second information to detect fraudulent credit
card purchases in real-time. Data mining, done in the context of a ZLE solution, enables
companies to respond quickly and effectively to business events.
1. Introduction
The Internet and e-commerce have transformed marketplace expectations. People want accurate, up-to-thesecond information and appropriate, instantaneous responses. Yet they also expect the same kind of
personalized attention they get from face-to-face interactions. Companies doing business on the Internet
must bring the entire enterprise up to Internet speed and find a way to personalize the e-commerce
experience. This must be done in an increasingly competitive business environment, where the ability to
react quickly to fast-changing market conditions while minimizing development and operations costs is
required to compete effectively.
To meet these challenges effectively, businesses must become what the Gartner Group has called a Zero
Latency Enterprise (ZLE), which is an organization that is able to make new information available
instantaneously across the enterprise and to use it to respond to business events in real-time and in an
appropriate and customized way.
Zero Latency Enterprise Solutions from Compaq and its partners directly address these challenges. The
Compaq ZLE solution architecture provides these capabilities by integrating applications and data through
a combination of EAI (Enterprise Application Integration) and ODS (Operational Data Store) technologies.
We describe this architecture, and the motivating principles behind it, in the next section of the paper.
The ZLE architecture has been applied successfully in a variety of industries, such as telecommunications,
finance, and retail. In Section 3, we describe an implementation of this architecture for a retail
environment, developed in collaboration with a large U.S. retailer (greater than US$30 billion in annual
revenue). This implementation allows the retailer to have a single, current, integrated, enterprise-wide
view of the customer, and to use this view to respond to customer interactions in real-time in an effective
and personalized way.
Many companies have partnered with Compaq in the development and delivery of ZLE solutions. One such
partner is SAS. ZLE solutions utilize SAS Enterprise Miner for data mining, and other SAS products
and services in various ways. SAS Enterprise Miner, and data mining in general, play a critical role in
ZLE solutions by helping to understand and to determine the responses to business events that are the most
appropriate and likely to be the most effective. A ZLE environment, in turn, greatly facilitates data mining
by providing a rich, integrated data source, and a platform through which mining results, such as predictive
models, can be deployed quickly and flexibly.
There are many possible applications of data mining in a ZLE solution: personalizing offers at the e-store
and other touchpoints; asset protection; campaign management; and real-time risk assessment. In Section
4, we describe one such application in detail: real-time credit card fraud detection in a retail environment.
This application was developed in partnership with asset protection professionals from the large U.S.
retailer mentioned previously. The paper ends in Section 5 with a summary and conclusions.
2. Compaq ZLE Solutions
Traditional approaches to application and data integration
There are two key capabilities inherent in a ZLE information system that pose significant technical
challenges. The first key capability is integrating and instantaneously disseminating new information
across an enterprise. The ability to do this successfully requires enterprise application integration (EAI),
which is typically accomplished by solutions that “push” information from a source application to various
targets in response to a business event. For example, if an order request were submitted to a Web-based
order management system, EAI could be used to forward the request to a backend fulfillment system.
To enable this kind of integration, an EAI solution must include adapters that allow various applications to
plug into the EAI communications infrastructure, then business rules for specifying the logic associated
with application-to-application interactions, and finally workflow management for managing the flow of
interactions across application systems. Data transformation technologies are also required at various
points in an EAI solution for mapping information from the structure and semantics of one application to
another.
The second key ZLE capability is the ability to use the disseminated and integrated information to respond
in real-time to business events. This capability has traditionally been implemented via caching data from
across the enterprise in a repository such as an operational data store (ODS), from which applications and
users can then extract information to meet business needs. An ODS is fed in near real-time by other
databases and applications in the enterprise, and it contains summarized data, like a data warehouse, as well
as the more detailed transactional data generated by operational systems.
Limitations of traditional integration approaches
Both EAI and ODS technologies have been applied in the past in isolation to provide application and data
integration, respectively. While traditional uses of these technologies have certainly been beneficial, they
fall far short of meeting the ZLE challenge described earlier. EAI technologies, for example, aren’t
designed to be available 24-by-7, or to handle high-volume event environments, such as the hundreds, or
even thousands of events per second in retail point-of-sale (POS) and e-store clickstream applications.
Also, EAI solutions typically provide very inefficient mechanisms for retrieving information from across
an enterprise, a key required capability in a ZLE solution. Finally, EAI by itself has no persistent long-term
storage, another required ZLE capability.
An ODS solution also falls short of meeting the ZLE challenge, for several reasons. First, it provides only
data integration and does not address the application integration issue at all. Second, once written to the
ODS, data is typically not updateable, an important capability for maintaining an integrated and consistent
view of the enterprise. Finally, while an ODS is more operationally focused than, say, a data warehouse,
the data in an ODS is usually not detailed enough to provide actual operational support for many enterprise
applications.
ZLE solution architecture
The architectural approach taken in Compaq ZLE solutions is to combine EAI and ODS technologies,
retaining the benefits of each, and using the two in combination to address the shortcomings discussed
above. This approach is shown in Figure 1 below.
Figure 1. Compaq Zero Latency Enterprise Solutions combine ODS and
EAI technologies, with the NonStop™ solutions integrator at the hub.
The EAI layer, in the form of the NonStop™ solutions integrator, includes adapters that support a variety of
application-to-application communications schemes, including messages, transactions, objects, and
database access. The ODS layer contains a cache of data from across the enterprise, which is updated
directly and in near real-time by application systems, or indirectly through the EAI layer.
In addition to the EAI and ODS hub of a Compaq ZLE solution shown in Figure 1 above, there are several
other key architectural components. One is a set of analysis marts, for doing data mining and other forms
of business intelligence, such as OLAP. These marts are fed data from the ODS, and the results of any
analysis performed in these marts are deployed back into the ZLE hub for use in operational systems.
Another important component, for customer-focused ZLE solutions, is the Customer Manager. This
component is responsible for maintaining a single, enriched and enterprise-wide view of the customer. The
tasks performed by the Customer Manager include: de-duplication of customer information (e.g.,
recognizing duplicate customer information resulting from minor spelling differences), propagating
changes to customer information to the ODS and all affected applications, and enriching in-house data with
third-party demographics, psychographics and other kinds of information.
Another architectural component for customer-focused ZLE solutions is the Interaction Manager. This
component is responsible for recommending appropriate responses to customer interactions, based on the
information maintained by the Customer Manager and other information in the ODS. The Interaction
Manager is often the vehicle through which data mining results, such as predictive models, are deployed.
The Interaction Manager is designed to support a flexible and expressive set of business rules.
Data mining in a ZLE environment
Data mining techniques and the ZLE solution architecture described above are very synergistic, in the sense
that data mining plays a key role in the overall solution, and the ZLE solution infrastructure, in turn, greatly
facilitates data mining. Data mining’s role is to help identify and understand the most effective ways to
respond to business events based on historical data. For example, an e-store clickstream can be analyzed,
and the factors (navigation, previous purchase patterns, etc.) associated with visitors that tend to buy certain
kinds of products identified. These factors can then be used in the Interaction Manager to determine the
most appropriate ads, offers and content to display to future e-store visitors.
In this manner, data mining directly supports the essence of a ZLE---responding in an appropriate and
customized way to business events, based on integrated and current data from across an enterprise.
The Compaq ZLE solution architecture, greatly facilitates data mining by (1) performing much of the data
preparation work for mining and (2) integrating many of the business processes and operational systems
required for the effective deployment of mining results. For example, the deployment of a model that
predicts, say, whether or not a customer will respond to an e-store offer, may require gathering customer
attributes such as demographics, purchase history, browse history and so on, from a variety of systems. In
a ZLE environment, this task is greatly simplified, because all this information is in the ODS in an
integrated and current form.
3. A ZLE Solution for Retail CRM
In the previous section, we saw the Compaq ZLE solution architecture. In this section, we discuss the
application of a ZLE solution to customer relationship management (CRM) in the retail industry, then
describe an actual implementation of this architecture developed in partnership with a large U.S. retailer.
New challenges for retailers
Neighborhood storeowners know their customers. When customers visit the store, the owner suggests
products likely to appeal. This kind of personalized service results in great customer loyalty, a cornerstone
of every retailer’s success. On the Internet and in today’s large retail chains, though, maintaining customer
loyalty through personalized service is much more challenging. In these environments, building a deep
understanding of customer preferences and needs is difficult, because the interactions that provide this
information are scattered across separate systems for sales, marketing, service, returns, credit cards, and so
on. Also, customers have many choices and can easily shop elsewhere.
To keep customers coming back, retailers of all sizes need to find a way to recapture the personal touch.
They need comprehensive knowledge of the customer that encompasses the customer’s entire relationship
with the retailer. Equally important is the ability to act on that customer knowledge instantaneously—for
example, by making personalized offers during every customer interaction, no matter how brief.
Obtaining a complete customer view
A key element of interacting with customers in a personalized way is having available a single, complete,
current, enterprise-wide view of the customer. This kind of customer view is simply not present in most
retail environments today. Retailers typically have a very fragmented view of customers resulting from the
separate and often incompatible computer systems for gift registry, credit card, returns, POS, e-store, and
so on. So, for example, if a customer attempts to return an item a few days after the return period, the
refund-desk representative that handles the return likely has no way of knowing if the customer is loyal and
profitable, thus warranting some leniency. Similarly, if a customer has just purchased an item, the
marketing department is not made aware that the customer should not be sent discount offers for that item
in the future.
A ZLE solution integrates all customer information from all channels, enabling retailers to make effective,
personalized offers at every customer touchpoint: the brick-and-mortar store, call center, or online e-store.
For example, an e-store customer who has just purchased gardening supplies at the brick-and-mortar store
can be offered complementary outdoor products the next time she visits the website.
A ZLE retail implementation
The retail ZLE implementation developed by Compaq, in partnership with a large retail customer and
various partner companies, consists of a framework with many components. These components can be
assembled, based on customer requirements and preferences, into a retail ZLE solution. The major
components in this solution framework are listed in Table 1 below.
Function
Operational data store
Application integration
Integration of disparate databases and systems
Integration of external demographic data
Customer consolidation and de-duplication
Business rules definition and execution
Call center
Customer campaign management
Data mining
Broadcast of alerts
E-store front end
Component
NonStop™ Himalaya™ servers with NonStop™
SQL database or AlphaServer systems with Oracle
8i™ database
Mercator Business Broker or Compaq
BusinessBus software
Common Object Request Broker Architecture
(CORBA) via NonStop™ DOM/MP and IBM
MQSeries software
Acxiom InfoBase software
Harte-Hanks Trillium or Acxiom AbiliTec
software
Blaze Advisor Solutions Suite software
Siebel Call Center software
RSI Protagona software
SAS Enterprise Miner software on Compaq
AlphaServer™ systems
MicroStrategy Broadcaster software
Microsoft Site Server Commerce Edition software
and Fast Start services from Compaq
Table 1. Functions and components for a retail ZLE solution.
Several elements of the ZLE solution architecture were described in the previous section. Let’s look at
each of these elements and how they relate to the functions and components listed in Table 1.
The core ODS and EAI architectural components are implemented by a Compaq Himalaya™ server with
the NonStop™ SQL database or an AlphaServer system with Oracle 8i™ (ODS), and by Mercator’s
Business Broker or Compaq’s BusinessBus (EAI). Additional integration is achieved through the use of
CORBA technology and IBM’s MQSeries software.
The Customer Manager, which maintains in the ODS a complete and current customer view, utilizes
Acxiom’s InfoBase software to enrich internal customer information with demographics and
psychographics. Consolidation and de-duplication of customer data is achieved via either Harte-Hanks’s
Trillium or Acxiom’s AbiliTec software.
The Interaction Manager uses the Blaze Advisor Solutions Suite software, which includes a Java-based
rules engine, for the definition and execution of business rules. The Interaction Manager suggests
appropriate responses to e-store visitor clicks, calls to the call center, point-of-sale purchases, refunds, and
a variety of other interactions across a retail enterprise.
Data mining analysis is performed via SAS Enterprise Miner running on a Compaq AlphaServer™
system. Source data for mining analysis is extracted from the ODS and moved to the mining platform. The
results of any mining analysis, such as predictive models, are deployed into the rules engine inside the
Interaction Manager or directly into ZLE applications. The ability to mix patterns discovered by
sophisticated mining analyses with business rules-of-thumb, policies, etc. inside the Interaction Manager is
very powerful.
There are lots of potential applications of data mining in a ZLE retail environment: e-store cross-sell and
up-sell; real-time fraud detection, both in physical stores and e-stores; campaign management; and making
personalized offers at all touchpoints. In the next section, we will take an in-depth look at one of these
applications, real-time fraud detection.
4. An Asset Protection Data Mining Application
Compaq and SAS consultants conducted an in-depth study with a large U.S. retailer of how to apply data
mining technology to the problem of detecting check fraud. Due to the sensitive nature of the check-fraud
study, the information that we can reveal about it is very limited. However, we have created a ZLE credit
card fraud demonstration, based on the check fraud study, that does not disclose any confidential
information. We will describe this credit card fraud demonstration in this paper.
The Opportunity
Data mining techniques provide an opportunity to detect fraud in the use of company issued credit cards –
fraud which otherwise would go undetected at the time of infraction. A strong business case exists to add
ZLE based data mining to a retailer’s asset protection program. Even though typical retail credit card fraud
rates are relatively small - in the .25 to 2% range – for a large retailer, even a small reduction in fraud
translates to millions of dollars saved per year.
We expect that most modern retailers do use some type of empirically-derived rules, or even predictive
mining models as part of their asset protection program. In either case, predictions are probably made
based on a very narrow customer view. The ZLE advantage is that models trained on current and
comprehensive customer information can utilize up-to-the-second information to make real-time
predictions.
For example, in this paper, we are discussing credit cards which are owned by the retailer, e.g. in-house
credit cards, not cards produced by a third party or bank. The card itself is branded with the retailer’s
name. In this case, the retailer has payment history and purchase history information for the consumer – an
ideal situation for data mining with ZLE.
Source Data
As discussed above, in Sections 2 and 3, all source data is contained in the ODS. As such, much of the data
preparation phase of standard data mining has already been accomplished. The cleaned, disparate sourced,
de-duped, demographically enriched data is ready to mine.
Successful data mining for fraud detection requires the creation of a case set with carefully chosen
variables, and derived variables. Note that we use the term variable to mean the same as attribute, column,
or field.
Each row in the demo data set describes the status of one credit card account. Each row can be thought of
as a case, and the goal of the data mining exercise is to find patterns that differentiate the fraud and nonfraud cases. The demo data set is referred to as a case set.
Credit card fraud rates are typically in the range of about .25% to 2%. For model building, it is important
to boost the percentage of fraud in the case set to the point where the ratio of fraud to non-fraud cases is
higher, to as much as 50%. The reason for this is that if there are relatively few cases of fraud in the
training set, the model building algorithms will have difficulty finding patterns in the data.
The data set used in the eCRM ZLE demonstration contains approximately 1 million sample records, with
each record describing the purchase activity of a customer on a company credit card. For the purposes of
this paper, each row in the case set represents aggregate customer account activity over some reasonable
time period such that it makes sense for this account to be classified as fraudulent or non-fraudulent. This
was done out of convenience due to our customer-centric view for demonstration purposes of the ZLE
environment. Real world case sets would more typically have one row per transaction, each row being
identified as a fraudulent or non-fraudulent transaction. The number of fraud cases, or records, is
approximately 125K, which translates to a fraudulent account rate of about .3% (125K out of the 40M
guests in the complete eCRM demo database). Note how low this rate is: much less than 1%. All 125K
fraud cases (i.e., customers for which credit-card fraud occurred) are in the case set, along with a sample of
approximately 875K non-fraud cases. Both the true fraud rate (.3%) and the ratio of non-fraud to fraud
cases (roughly 7 to 1) in the case set are typical of what is found in real fraud detection studies. The demo
data set is a synthetic one, in which we planted several patterns (described in detail below) associated with
fraudulent credit card purchases.
We accounted for the difference between the true population fraud rate of 0.3% and the sample fraud rate
of 12.5% by using the Prior probability feature of Enterprise Miner - a feature expressly designed for this
purpose. Enterprise Miner (EM) allows the user to set the true population probability of the rare target
event. EM then automatically takes this into consideration in all model assessment calculations. This is
discussed in more detail below in the model deployment section of the paper.
The demonstration case set contained the following fields:
•
•
•
•
•
•
•
•
RAC30: number of cards reissued in the last 30 days.
TSPUR7: total number of store purchases in the last 7 days.
TSRFN3: total number of store refunds in the last 3 days.
TSRFNV1: total number of different stores visited for refunds in the last 1 day.
TSPUR3: total number of store purchases in the last 3 days.
NSPD83: normalized measure of store purchases in department 8 (electronics) over the last 3 days.
This variable is normalized in the sense that it is the number of purchases in department 8 in the last 3
days, divided by the number of purchases in the same department over the last 60 days.
TSAMT7: total dollar amount spent in stores in the last 7 days.
FRAUDFLAG: target variable.
The first seven are independent variables (i.e., the information that will be used to make a fraud prediction),
and the eighth is the dependent or target variable (i.e., the outcome being predicted).
Note that building the case set requires access to detailed, transaction-level data (e.g., to determine
NSPD83) and data from multiple customer touchpoints (RAC30, which would normally be stored in a
credit card system, and variables such as TSPUR7, that describe in-store POS activity, would be stored in a
different system). Also, the case set includes data up to the current day. The task of building an up-to-date
data set from multiple systems is facilitated greatly in a ZLE environment, but much more difficult in other
environments.
Note that RAC30, TSPUR7, TSRFN3, TSRFNV1, TSPUR3, NSPD83, and TSAMT7 are “derived”
variables. The ODS does not carry this information in exactly this form. These records were created by
calculation from other existing fields. An appropriate set of SQL queries is one common way to create the
case set.
Credit Card Fraud Methods
According to the asset protection professionals that participated in the study, one technique used to commit
fraud begins by stealing a newly issued credit card. For example, a store may send out a new card to a
customer and a thief may steal it out of the customer’s mailbox. The data set contains a variable that
describes whether or not cards have been reissued recently (RAC30).
Once a thief gets a stolen credit card, he or she typically uses it frequently in a short period of time, e.g., in
1-7 days, before the stolen card is reported and stops being accepted. The data set contains variables that
describe the total number of store purchases over the last 3 and 7 days, and the total amount spent in the
last 7 days.
Credit card thieves also tend to buy small expensive things, like consumer electronics. These are evidently
desirable either for personal use by the thief, or because they are easy to sell “on the street”. The variable
NSPD83 is a measure of the history of electronics purchases.
Finally, thieves sometimes return merchandise bought with a stolen credit card for a cash refund. One
technique for doing this is to use a fraudulent check to get a positive balance on a credit card, then items are
bought and returned. Because there is a positive balance on the card used to purchase the goods, a cash
refund is issued. (Seems like a questionable business practice to refund cash for something bought on a
credit card, but evidently some stores do this!) Thieves often refund merchandise at different stores in the
same city, to lower the chance of being caught. The data set contains several measures of refund activity.
To summarize, the purchase patterns associated with a stolen credit card are: lots of purchases in a short
period of time, high total dollar amount, cards recently reissued, purchases of electronics, suspicious refund
activity, and so on. These are some of the patterns that the models built in the demonstration will detect.
Modeling
SAS Enterprise Miner supports a visual programming model, where nodes, which represent various
processing steps, are connected together into process flows. The demonstration process flow diagram
contains the following nodes:
Figure 2. Demonstration process flow diagram.
Note that Enterprise Miner version 3 was used in this demonstration. Version 4 has since been released,
but the principles are the same. The goal of the analysis is to build a model that predicts credit card fraud.
The Enterprise Miner interface allows for quick model creation, and easy comparison of model
performance. Here is an example of the lower right portion of the output from the Tree node:
Figure 3. Tree node output.
The various paths through the tree, and the IF-THEN rules associated with them, describe the fraud patterns
associated with credit card fraud. One interesting path through the tree is:
If Cards reissued in last 30 days and
Total store purchases in last 7 days > 1 and
Number of different stores visited for refunds in current day > 1 and
Normalized number of purchases in electronics dept. in last 3 days > 2 Then
Probability of fraud is HIGH
As described above, the conditions in this rule identify some of the telltale signs of credit card fraud,
resulting in a prediction of fraud with high probability. The leaf node corresponding to this tree has a high
concentration of fraud (approximately 80% fraud cases, 20% non-fraud) in the training and validation sets.
(The first column of numbers shown on this and other nodes in the tree describes the training set, and the
second column the validation set.) Note that the leaf nodes are color coded, with red nodes containing
relatively little fraud, and green the most.
Another rule in the tree is:
If Cards reissued in last 30 days and
Total store purchases in last 7 days > 1 and
Number of different stores visited for refunds in current day > 1 and
Normalized number of purchases in electronics dept. in last 3 days <= 2 and
Total amount of store purchases in last 7 days >= 700 Then
Probability of fraud is HIGH
This is similar to the previous rule, except that fewer electronics items are purchased, but the total dollar
amount purchased in the last 7 days is relatively large (at least $700).
An alternative data mining model, produced by the neural network node in Enterprise Miner, gives very
comparable results. In fact, the relative performance of these two classic data mining tools was very
similar – even though the approaches are completely different. It is possible that tweaking the parameters
of the neural network model might have given us a more powerful tool for fraud prediction, but this was
not done during this study.
Prediction logic is apparent and easily understood, in the form of IF-THEN rules, in the decision tree
model. Contrast that with the neural network model, which basically uses a complex function of the input
variables to estimate the fraud probability. Understanding exactly how a model is making its predictions is
often important to business users. We found this to be the case with the asset protection personnel of this
major retailer. In addition, there are potential legal issues – it may be that a retailer cannot deny service to
a customer without a clear English explanation – something that is not possible with a neural network
model.
Model Assessment
The best way to assess the value of these data mining models is a profit matrix, a variant of a “confusion
matrix” which details the expected benefit of using the model, as broken down by the types of prediction
errors that can be made. The classic confusion matrix is a simple 2x2 matrix assessing the performance of
the data mining model by examining the frequency of classification successes/errors. Ideally, this is done
with a holdout test data set, one that has not been used or looked at in any way during the model creation
phase. The data mining model calculates an estimate of the probability that the target variable, fraud in our
case, is true. When using a decision tree model, all of the samples in a given node of the resulting tree have
the same predicted probability of fraud associated with them. When using the neural network model, each
sample may have its own unique probability estimate. A business decision is then made to determine a
cutoff probability. Samples with probability higher than the cutoff are predicted fraudulent, samples below
the cutoff are predicted as non-fraudulent.
Since we oversampled the data, there are actually two probabilities involved: the prior probability and the
posterior probability of fraud. The prior represents the true proportion of fraud cases in the total population
– a number often less than 1%. The posterior represents the proportion of fraud in the oversampled case set
– as much as 50%. After setting up Enterprise Miner’s prior probability of fraud for the target variable to
reflect the true population probability, Enterprise Miner adjusts all output tables, trees, charts, graphs, etc.
to show results as though no oversampling had occurred – scaling all output probabilities and counts to
reflect how they would appear in the actual (prior) population. Enterprise Miner’s ability to specify the
Prior probability of the target variable is a very beneficial feature for the user.
For easy reference, here is the confusion matrix, in general:
Actual
Predicted
0
1
0
True
negatives
False
positives
1
False
negatives
True
positives
Figure 4. Confusion matrix.
The entries in the cells are usually counts. Ratios of various counts and/or sums of counts are often
calculated to compute various figures of merit for the performance of the prediction/classification
algorithm.
Consider a very simple algorithm, requiring no data mining – that of simply deciding that all cases are not
fraudulent. This represents a baseline model with which to compare our data mining models. Here is the
resulting confusion matrix:
Actual
Predicted
0
1
0
997,000
0
1
3000
0
Figure 5. Confusion matrix for a model that always predicts no fraud.
This extremely simple algorithm would be correct 99.7% of the time! But no fraud would ever be detected.
It has a hit rate of 0%. To improve on this, we must predict some fraud. Inevitably, doing so will increase
the false positives as well.
Here is an example confusion matrix, for some assumed cutoff, showing sample counts for holdout test
data:
Actual
Predicted
0
1
0
994,500
2500
1
1800
1200
Figure 6. Confusion matrix for a mining model.
The choice of cutoff is a very important business decision. In reviewing the results of the study with this
major retailer, it became extraordinarily clear that this decision as to where to place the cutoff makes all the
difference between a profitable and not so profitable asset protection program.
Let’s examine the example confusion matrix presented above in more detail. Here are some summary
statistics from this one table:
(note that positives = frauds)
Assessment Measure
Value
Number of total samples
1,000,000
Number of actual frauds
3000
Calculated percentage of fraud
.3%
Accuracy = fraction classified correctly
99.6%
Sensitivity = Recall = “hit rate” = true positives/actual positives
1200/3000 = 40%
Precision = positive predicted value (PV+) = true positives/predicted positives
1200/3700 = 32.4%
Specificity = true negatives / actual negatives
99.75%
Table 2. Assessment measures for Figure 6 mining model.
Remarkably, even though the accuracy of the model is extremely good – the model classifies 99.6% of
holdout case set samples correctly - the Recall and Precision are not nearly as good, 40% and 32%
respectively. This is a common situation when data mining for fraud detection, or any low probability
event situation.
As a business decision, the major retailer can decide to alter the probability threshold (cutoff) in the model
– that point at which a sample is considered fraudulent vs. not. Using the very same decision tree or neural
network, a different confusion matrix results. For example, if they increase the cutoff probability, they will
have fewer hits (predict fewer frauds). The confusion matrix might look like this:
Actual
Predicted
0
1
0
996,850
150
1
2400
600
Figure 7. Confusion matrix for a higher cutoff probability.
The hit rate, or sensitivity, is 600/3000=20%, half as good as the previous cutoff. However, the precision
has improved from 32% to 80%. Fewer false positives means fewer customers getting angry because
they’ve falsely been accused of fraudulent behavior. The expense of this benefit comes in the form of less
fraud being caught.
Model Deployment
To make a proper determination about where to place the cutoff, the retailer needs to compare costs
involved with turning away good customers to margin lost on goods stolen through genuine credit card
fraud. A significant issue, which came up in discussions with the retailer, was determining the best way to
deploy the fraud prediction. Since the Compaq ZLE solution makes a determination of fraud immediately
at the time of the transaction, if the data mining model predicts a given transaction is with a fraudulent card,
various incentives to disallow the transaction can be initiated – without necessarily outright denial. In other
words, measures need to be taken which discourage further fraudulent use of the card, but which will not
otherwise be considered harmful to the customer who is not committing any fraud whatsoever. Examples
of this might be asking to see another form of identification, (if the credit card is being used in a brick and
mortar venue), or asking for further reference information from the customer if it is an e-store transaction.
Once a data mining model is built, the model output is converted to rules. Those rules are entered into the
ZLE business rules engine inside the Interaction Manager. These rules are mixed with other kinds of rules,
such as policies, as well. Note that decision tree results are already in essential rule form – if then
statements that are a function of the structure of the leaves and nodes of the tree. Neural net output can also
be placed in the rules engine by simply creating a calculation rule which applies the neural network to the
requisite variables, generating a fraud/no fraud prediction. For example, JAVA code performing the
necessary calculations on the input variables could be generated by Enterprise Miner and inserted directly
into the ZLE Interaction Manager.
5. Summary and Conclusions
In today’s demanding business environment, customers expect current and complete information to be
available continuously, and interactions of all kinds to be customized and appropriate. An organization
must be able to disseminate new information instantaneously across the enterprise and use it to respond
appropriately and in real-time to business events.
In this paper, we described a ZLE solution architecture from Compaq and partner companies that directly
addresses these challenges, and an implementation of this architecture for CRM in a retail environment.
Data mining technology and this solution architecture are very synergistic. Data mining plays the key role
in a ZLE solution of helping to understand and to determine the best ways to respond to business events.
The ZLE solution infrastructure, in turn, greatly facilitates mining by providing an integrated, data-rich
environment.
We described in detail a ZLE data mining application that uses SAS Enterprise Miner to detect retail
credit card fraud. The current, comprehensive customer information available in a ZLE environment
allows effective models to be built quickly in Enterprise Miner. The ZLE environment allows these
models to be deployed easily and used in real-time to detect fraudulent credit card purchases.
Data mining, done in the context of a ZLE solution, enables companies to respond quickly and effectively
to business events.
Contact Information:
Philip R. Bosinoff: [email protected]
Michael L. Heytens: [email protected]
SAS and Enterprise Miner are registered trademarks or trademarks of SAS Institute Inc. in the USA and other
countries.  indicates USA registration. Other brand and product names are trademarks of their respective companies.