Download D1.1_SME-E-COMPASS_Methodological_Framework

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
E-COMmerce Proficient Analytics in Security and Sales for SMEs
D1.1 – SME E-COMPASS METHODOLOGICAL FRAMEWORK
Contractual Delivery Date: M3 – March 2014
Actual Delivery Date: March 2014
Nature: Report
Version: 1.0
PUBLIC Deliverable
Abstract
This report summarizes in a comprehensive manner both current approaches to fraud
prevention and data mining tools reported in the scientific literature as well as used in the
e-commerce practice globally. It also illustrates the required models and theories that will
be implemented in the project’s applications for the benefit of SMEs Associations and its
members.
 Copyright by the SME E-COMPASS consortium, 2014-2015
SME E-COMPASS is a project co-funded by the European Commission within the 7th Framework Programme.
For more information on SME E-COMPASS, please visit http://www.sme-ecompass.eu/
DISCLAIMER
This document contains material, which is the copyright of the SME E-COMPASS consortium members and the
European Commission, and may not be reproduced or copied without permission, except as mandated by the
European Commission Grant Agreement no 315637 for reviewing and dissemination purposes.
The information contained in this document is provided by the copyright holders "as is" and any express or implied
warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular
purpose are disclaimed. In no event shall the members of the SME E-COMPASS collaboration, including the
copyright holders, or the European Commission be liable for any direct, indirect, incidental, special, exemplary, or
consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data,
or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict
liability, or tort (including negligence or otherwise) arising in any way out of the use of the information contained in
this document, even if advised of the possibility of such damage
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Table of Contents
Table of Contents .......................................................................................................................... 2
Table of Figures ............................................................................................................................. 5
Table of Tables .............................................................................................................................. 6
Terms and abbreviations............................................................................................................... 7
Executive Summary ..................................................................................................................... 10
1
2
3
Introduction ........................................................................................................................ 17
1.1
About this deliverable ................................................................................................. 17
1.2
Document structure .................................................................................................... 18
Definitions ........................................................................................................................... 19
2.1
Online Fraud ................................................................................................................ 19
2.2
Data Mining and web Analytics for e-Sales Operations .............................................. 23
2.3
Semantic Web ............................................................................................................. 34
2.3.1
Linked Data .......................................................................................................... 35
2.3.2
Ontologies ........................................................................................................... 35
2.3.3
Web ontology languages ..................................................................................... 36
Analysis of online anti-fraud systems ................................................................................. 38
3.1
Current Trends and Practices ...................................................................................... 38
3.1.1
Introduction......................................................................................................... 38
3.1.2
Manual order review ........................................................................................... 38
3.1.3
Data used in fraud detection............................................................................... 39
3.2
State-of-the-art technologies ...................................................................................... 41
3.2.1
Introduction......................................................................................................... 41
3.2.2
Expert systems .................................................................................................... 42
3.2.3
Supervised learning techniques .......................................................................... 43
3.2.4
Anomaly detection technologies ........................................................................ 46
3.2.5
Hybrid architectures ............................................................................................ 47
3.2.6
Semantic Web technologies and fraud detection ............................................... 48
Grant Agreement 315637
PUBLIC
Page 2 of 144
SME E-COMPASS
3.3
4
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Commercial products in place ..................................................................................... 52
3.3.1
Product: Accertify (an American express product) ............................................. 52
3.3.2
Product: Cardinalcommerce ............................................................................... 53
3.3.3
Product: Identitymind ......................................................................................... 53
3.3.4
Product: Iovation ................................................................................................. 54
3.3.5
Product: Kount .................................................................................................... 55
3.3.6
Product: Lexisnexis .............................................................................................. 56
3.3.7
Product: Maxmind ............................................................................................... 56
3.3.8
Product: Subuno .................................................................................................. 57
3.3.9
Product: Braspag ................................................................................................. 58
3.3.10
Product: Fraud.net .............................................................................................. 59
3.3.11
Product: Volance ................................................................................................. 59
3.3.12
Product: Authorize.net by Cybersource.com (a Visa company) ......................... 60
3.3.13
Product: 41st Parameter ..................................................................................... 61
3.3.14
Product: Threatmetrix ......................................................................................... 62
3.3.15
Product: Digitalresolve ........................................................................................ 63
3.3.16
Product: Nudatasecurity ..................................................................................... 64
3.3.17
Product: Easysol .................................................................................................. 64
3.4
Research project results .............................................................................................. 67
3.5
Weaknesses and limitations of current practices compared to SME needs ............... 69
3.5.1
Introduction......................................................................................................... 69
3.5.2
Lack of adaptivity ................................................................................................ 69
3.5.3
Lack of publicly available data/ joint actions ...................................................... 70
3.5.4
Scalability issues .................................................................................................. 71
3.5.5
Limitations in integrating heterogeneous data and information sources .......... 72
3.5.6
Dealing with case imbalance and skewed class distributions ............................. 72
3.5.7
Difficulties in managing late- or false-labelled cases .......................................... 73
3.5.8
Cost-efficiency concerns ..................................................................................... 73
3.5.9
Lack of transparency and interpretability ........................................................... 75
Analysis of data mining for e-sales ...................................................................................... 76
4.1
State-of-the-art technologies ...................................................................................... 76
4.1.1
Data gathering ..................................................................................................... 77
4.1.1.1
Conversion information .................................................................................. 77
4.1.1.2
User behaviour information ............................................................................ 77
4.1.1.3
Competitor information .................................................................................. 77
Grant Agreement 315637
PUBLIC
Page 3 of 144
SME E-COMPASS
4.1.2
Data extraction and analysis ............................................................................... 78
4.1.3
Automatized reaction to data analysis................................................................ 78
4.1.4
Information presentation/visualization .............................................................. 80
4.2
Trends and practices for e-sales ................................................................................. 80
4.3
Data mining techniques for e-sales ............................................................................. 84
4.4
Trends & practices vs data mining techniques for e-sales .......................................... 85
4.5
Commercial products in place ..................................................................................... 86
4.5.1
E-shop software................................................................................................... 86
4.5.2
Price Search ......................................................................................................... 88
4.5.3
Web analysis........................................................................................................ 89
4.5.4
Data mining suites ............................................................................................... 91
4.6
Open source data mining products in place ............................................................... 92
4.7
Trends & practices vs data mining techniques for e-sales vs data mining suites ....... 94
4.8
Research project results and scientific literature ....................................................... 95
4.8.1
Research Projects ................................................................................................ 96
4.8.2
Scientific Literature ............................................................................................. 98
4.9
5
Weaknesses and limitations of current practices compared to SME needs ............. 100
From Knowledge Harvesting to Designing E-COMPASS Methodological Framework ...... 104
5.1
Technologies Pre-selection ...................................................................................... 104
5.1.1
Anti-fraud System .............................................................................................. 104
5.1.2
Data mining for e-Sales ..................................................................................... 109
5.1.3
Semantic web Integration ................................................................................. 116
5.2
Objectives .................................................................................................................. 117
5.2.1
Anti-Fraud System’s Objectives......................................................................... 117
5.2.2
Objectives – Online data mining ....................................................................... 118
5.3
6
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Integration Framework for the Design Process ........................................................ 120
APPENDIX .......................................................................................................................... 121
6.1
Web analytics techniques (for visitors behaviour analysis) ...................................... 121
6.2
Metrics for customer behaviour analysis .................................................................. 124
6.3
A classification of empirical studies employing state-of-the art fraud detection
technologies .......................................................................................................................... 127
7
References ......................................................................................................................... 131
Grant Agreement 315637
PUBLIC
Page 4 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Table of Figures
FIGURE 1: NUMBER OF GLOBAL E-COMMERCE TRANSACTIONS (BILLION), 2010–2014F ................. 24
FIGURE 2: B2C E-COMMERCE REVENUE WORLDWIDE IN 2011 AND 2012 AND THE FORECASTS UNTIL
2016 (IN BILLION US-DOLLAR) (EMARKETER, 2013A) ................................................. 25
FIGURE 3: B2C E-COMMERCE REVENUE IN EUROPE IN 2011 AND 2012 AND FORECASTS UNTIL 2016 (IN
BILLION US-DOLLAR) (EMARKETER, 2013B) ............................................................... 25
FIGURE 4: B2C E-COMMERCE REVENUE DEPENDING ON CERTAIN REGIONS OF THE WORLD IN 2012 AND
FORECASTS UNTIL 2016 (IN BILLION US-DOLLAR) (EMARKETER, 2013A) ......................... 26
FIGURE 5: SHARE OF ONLINE BUYERS OF THE WHOLE POPULATION IN GERMANY FROM 2000 TO 2013
(INSTITUT FÜR DEMOSKOPIE ALLENSBACH, 2013) ....................................................... 27
FIGURE 6: SHARE OF ONLINE PURCHASES IN COMPARISON TO THE OVERALL PURCHASES PER AGE GROUP IN
GERMANY IN 2012 (BUNDESVERBAND DIGITALE WIRTSCHAFT (BVDW) E.V., 2012) ....... 27
FIGURE 7: TOP 20 PRODUCT GROUPS IN E-COMMERCE DEPENDING ON REVENUE IN GERMANY IN 2012
(IN MILLION EURO) (BVH, 2013B) ............................................................................ 28
FIGURE 8: VISITOR NUMBERS OF THE LARGEST E-SHOPS IN GERMANY IN JUNE 2013 (IN MILLION)
(LEBENSMITTELZEITUNG.NET, 2013) ......................................................................... 29
FIGURE 9: REVENUE SHARE OF THE TOP10, TOP100 AND TOP500 E-SHOPS OF THE WHOLE MARKET IN
GERMANY IN 2012 (EHI RETAIL INSTITUTE, STATISTA, 2013) ....................................... 29
FIGURE 10. THE SEMANTIC WEB TOWER .................................................................................. 34
FIGURE 11: BUSINESS VISION AND E-MARKETING ....................................................................... 84
FIGURE 12: WHICH MARKETING ACTIVITIES DO YOU CONDUCT IN ORDER TO ATTRACT VISITORS TO YOUR ESHOP (BAUER ET AL., 2011) .................................................................................. 100
FIGURE 13: WHY DON'T YOU USE A WEB ANALYTICS TOOL? (BAUER ET AL., 2011)......................... 102
FIGURE 14: A
SCHEMATIC DESCRIPTION OF THE ANTI-FRAUD SYSTEM FUNCTIONALITIES AND
ARCHITECTURE. ................................................................................................... 106
FIGURE 15: THE ORDER EVALUATION PROCESS......................................................................... 107
FIGURE 16: DATA MINING SME E-COMPASS ARCHITECTURE ................................................. 110
FIGURE 17 THE RDF REPOSITORY AND ITS RELATIONS WITH THE PROJECT WORK PACKAGES ............ 121
Grant Agreement 315637
PUBLIC
Page 5 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
FIGURE 18: TECHNIQUES APPLIED FOR RECOGNIZING RECURRING VISITORS (BAUER ET AL., 2011) ..... 123
Table of Tables
TABLE 1: EUROPEAN B2C E-COMMERCE REVENUE OF GOODS AND SERVICES ........................... 26
TABLE 2: FUNCTIONALITY COMPARISON TABLE OF ANTI-FRAUD COMMERCIAL PRODUCTS ..... 66
TABLE 3: LIST OF E-MARKETING TRENDS ..................................................................................... 81
TABLE 4: TRENDS & PRACTICES OF E-SALES VERSUS E-MARKETING TRENDS .............................. 83
TABLE 5: TRENDS AND PRACTICES VS. DATA MINING TECHNIQUES FOR E-SALES ....................... 86
TABLE 6: COMMERCIAL AND OPEN SOURCE E-SHOP SOFTWARE ............................................... 88
TABLE 7: PRICE SEARCH ENGINES IN EUROPE .............................................................................. 89
TABLE 8: DATA MINING SUITES ................................................................................................... 92
TABLE 9: OPEN SOURCE PRODUCTS IN PLACE ............................................................................. 93
TABLE 10: TRENDS & PRACTICES VS. DATA MINING TECHNIQUES VS. DATA MINING SUITES ..... 95
TABLE 11: WEB ANALYTICS METRICS BY THE WEB ANALYTICS ASSOCIATION .......................... 125
TABLE 12: WEB ANALYTICS METRICS BY IBI RESEARCH ............................................................. 126
TABLE 13: TABLE A CLASSIFICATION OF EMPIRICAL STUDIES EMPLOYING STATE-OF-THE-ART
FRAUD DETECTION TECHNOLOGIES........................................................................... 127
Grant Agreement 315637
PUBLIC
Page 6 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Terms and abbreviations
A
FDS
Advanced Fraud Detection Suite
AIS
Artificial immune systems
API
Application Programming Interface
AVS
Address Verification Service
BI
Business intelligence
BIN
Bank Identification Number
BSc
Business Scorecard
C2B
Consumer-to-Business
CCV
Card Code Verification
CI
Computational Intelligence
CNP
Card-not-present
COPL
Lower cut-off point
COPU
Upper cut-off point
CRISPDM
Cross-Industry Standard Process for Data Mining
DB
Database
EAN
International Article Number
EC
European Commission
ECA
Event-condition-actions
ECC
SME E-COMPASS cockpit
EMT
e-marketing trends
EPS
Ebay-Powerseller
ES
Expert systems
ETL
Extract-transform-load
Grant Agreement 315637
PUBLIC
Page 7 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
FD
Fraud detector
FDS
Fraud detection system
FP
Fraud prevention
GLT
Goods lost in transit
GTIN
Global Trade Item Number
IPP
Internet-Pure-Player
MCV
Multi-Channel-Vendors
MGV
Manufacturing Vendors
MMC
Merchant Category Code
NI
Nature-inspired
OCR
Over-the-counter retail
OPS
Online Pharmacies
OWL
Ontology Language
PSPs
Payment services providers
RDF
Resource Description Framework
RS
Risk score
SaaS
Software as a Service
SIC
Standard Industrial Classification
SM
Small and Medium
SME ECOMPASS
E-COMmerce Proficient Analytics in Security and Sales for SMEs
SME
Small Medium Enterprise
SVM
Support Vector Machines
TA
Transaction analytics
TAT
Transaction Analytics Toolkit
TSV
Teleshopping Vendor
URI
Uniform Resource Identifier
W3C
World Wide Web Consortium
Grant Agreement 315637
PUBLIC
Page 8 of 144
SME E-COMPASS
WP
Grant Agreement 315637
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Work Package
PUBLIC
Page 9 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Executive Summary
SME E-COMPASS Anti-Fraud Methodological Framework
What is nowadays called online or internet fraud is a constant plague for e-commerce, despite
the various efforts that have been made in the directions of developing new anti-fraud
technologies and reinforcing the legislative framework. This is mainly because fraudsters are
highly adaptive to current defensive measures, constantly devising new tactics for breaching a
security system. Among the various types of fraud, those related to credit card payments are
undoubtedly the most frequently encountered and difficult to deal with. Credit-card payment
and other types of online fraud entail risks and losses for all “rings” of the e-commerce chain:
online merchants, customers, issuing and acquiring banks. In addition to that, they lead to
societal costs, as they threaten the very existence of e-commerce: the customer’s faith on
internet as a reliable and viable sales channel. Therefore, it becomes crucial for e-commerce
actors to design systems or processes that could either stop fraudulent activity in the first
place or be able to detect it early before its consequences escalate.
This is an essential step for European SMEs active in e-commerce in order to strengthen their
sustainability, increase the confidence of its customers on security issues and expand in new
cross-border markets in Europe. Reducing the need for manual review and increasing the
efficiency of the reviewing system is a key component for e-SMEs towards growing online
business profits and managing the total cost of online payment fraud. Therefore, it always pays
off to invest in new technologies that could early detect malicious activities before their
consequences become evident to the online merchant.
Fraud detection systems (FDS) are nowadays quite popular in e-commerce; for instance they
are used by more than half of the US and Canadian merchants doing business online. A typical
FDS receives information on the transaction parameters or the customer profile and comes up
with an indication as to the riskiness of the particular order (riskiness/suspiciousness score).
Based on its initial risk assessment, the order can follow three independent routes: instant
execution, automatic rejection or suspension for manual review. Modern FDS are typically
categorized in three groups: expert systems, supervised learning techniques and anomaly
detection methods. These are of varying degree of sophistication and also differ as to the
mechanisms used to acquire and represent knowledge. A fourth group recently appeared
mostly in the literature, are hybrid systems, that can be roughly defined as smart combinations
of possibly heterogeneous components with the aim of delivering superior performance to its
building blocks. Hybridization is typically achieved along two different routes: i. the
aggregation of homogeneous entities and ii. the blending of heterogeneous technologies.
Additionally, use of ontologies and ontology-related technologies for building knowledge bases
for rule-base systems is considered quite beneficial for a FDS. Ontologies provide an excellent
way of capturing and representing domain knowledge, mainly due to their expressive power.
Grant Agreement 315637
PUBLIC
Page 10 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Furthermore, a number of well-established methodologies, languages and tools developed in
the ontological engineering area can make the building of the knowledge base easier, more
accurate and more efficient.
In this report we try to expose the weaknesses and limitations of fraud detection technologies
and practices already in place. The discussion was given with an eye on the special features of
the application domain and the business environment faced by (Small and Medium) SM online
merchants. The main weaknesses identify were briefly the lack of adaptivity and of publicly
available data and joint actions, limitations in scalability and in the integration of
heterogeneous data and information sources, imbalance and skewed class distributions,
difficulties in managing late- or false-labelled cases, cost-efficiency as well as lack of
transparency and interpretability.
The nearly two decades of development for fraud monitoring systems have witnessed a
flourishing of different types of technologies with often promising results. In the early years,
fraud detection was accomplished with standard classification, clustering, data mining and
outlier detection models. Researchers soon realized the peculiarities of the problem domain
and introduced more advanced solutions, such as nature-inspired intelligent algorithms or
hybrid systems. The latter stream of research advocates the combination of multiple
technologies as a promising strategy for obtaining a desirable level of flexibility. First results
from the adoption of this practice to real-life e-commerce environments seem encouraging.
Still, how best to fine-tune a hybrid system presents a challenge to the designer, as it very
much depends on performance aspirations (cost-efficiency vs. prediction accuracy) and the
conditions of the operating environment.
Our methodological framework for an automatic fraud detector customized to European SME
needs follows the hybrid-architecture principle, in the spirit discussed above. For the Antifraud system-service that will be developed in the context of the project the following
technologies are pre-defined and pre-selected:
1) an expert system with multiple rules-of-thumb for assessing the riskiness of each
transaction,
2) a variety of supervised learning models to be used for extracting patterns of fraudulent
activity from the transaction database (DB),
3) anomaly detectors are well suited for online fraud monitoring, as they do not typically rely
on experts to provide signatures for all possible types of fraud. Among the great range of
candidate technologies, we particularly favour the application of hybrid (semi-supervised)
novelty detectors, combining statistical techniques with computational intelligent models,
4) implementation of an inference engine to coordinate the risk assessment process and
provide an aggregate suspiciousness score through which each transaction can be classified in
predefined categories (normal, malicious, under review),
Grant Agreement 315637
PUBLIC
Page 11 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
5) transaction analytics technologies that typically provide the fraud analyst with technical or
geographical information about each transaction and thus supplement in many ways
traditional background investigations on customer profiles.
As far as the scientific and technological objectives are concerned, these can be summarised as
following:
 Extracting common fraudulent behaviours.
 Disseminating novel patterns of cybercriminal activity.
 Developing hybrid system architectures experimenting with different levels of
hybridization. We particularly favour the use of nature-inspired intelligent algorithms
as standalone detectors or as part of a hybrid transaction-monitoring system.
 Improving the readability of the automated fraud detection process.
 Creating an adaptive fraud-detection framework.
 Improving the cost-efficiency of the overall fraud detection process.
 Exploitation of cross-sectoral data and global information sources.
 Software-as-a-service application
SME E-COMPASS Data-Mining for e-Sales Methodological Framework
Every e-shop owner needs to compete in a much broader regional or even national context in
comparison to the traditional sales of products over conventional stores. On the one hand,
identical or at least similar products are offered over the web and the product information can
be retrieved and compared with the offers of competitors by potential customers within
seconds and without great effort. On the other hand, the customers’ demand changes from
time to time and sometimes very fast. Thus, e-shop owners need to identify those changes and
react appropriately.
In order to successfully position the own e-shop in such a competitive environment, relevant
information about the competitors and the own (potential) customers are essential. Precise
knowledge of the customers’ preferences, for this reason, must be gathered by the owners of
e-shops to find out to whom (potential customers), what (products and services), how
(marketing channels and design of the e-shops) and when (time) to address the target groups.
Therefore, the sales process requires a deep data analysis to know the “consumer decision
journey”.
This requires precise knowledge of the customer´s preferences, for this reason, holders of
e-shops must find out to whom, to what, to how and to when to refer to the customer.
Therefore the sales process requires a deep data analysis to know the “consumer decision
journey”.
Grant Agreement 315637
PUBLIC
Page 12 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
When examining data mining for e-sales the following issues become relevant, data gathering,
extraction and analysis, automatized reaction to data analysis, information
presentation/visualization. In order to monitor the (potential) buyers, e.g. visitors and
customers on the own e-shop, several web analytics tools have been developed. Web analytics
tools gather web usage data, analyze and visualize them. Thus, web Analytics can be
considered as a part of data mining which adopts very similar technologies.
Two main wide-spread techniques exist to conduct web analytics: web server logfile analysis
and page tagging. Other methods and techniques, such as conversion paths (funnel), click path
analyses, clickmap, heatmap, motion player, attention map, visibility map, and visitor feedback
are additionally applied for specific purposes.
The three main types of data that are crucial for e-shop owners are data about:
1. where the customer came from before he visited the e-shop and, in case of search
engines as the last step before visiting, which keywords where used for the search
2. the users’ behaviour onsite, e.g. usage statistics and real-time behaviour
3. competitor products, prices and their terms and conditions as well as their marketing
strategies and actions
With tools and methods of web analytics and data mining, information can be derived from
these data that allows the e-shop owners to understand their customers and potential
customers better and to optimize their offering and marketing. Web analytics tools usually
analyze web site referrers in order to provide the first kind of data. This is used to optimize
marketing activities and marketing channels. The second kind of data provides insights in user
behaviour and potentials for the optimization of the own web site or e-shop. The challenges
for e-shop owners and therefore the state of the art which needs to be taken into account are
in the following areas: i. gathering the kinds of data from which valuable information can be
derived, ii. extracting valuable information from those data sets, iii. analyzing this valuable
information in a way that appropriate actions can be taken and iv. automatizing these actions.
The most commonly accepted definition of “data mining” is the discovery of “models” for data.
Data mining methods can be clustered into two main categories, prediction and knowledge
discovery. While prediction is the strongest goal, knowledge discovery is the weaker approach
and usually prior to prediction. Furthermore, the prediction methods can be noted into
classification and regression while knowledge discovery can be acclaimed into clustering,
mining association rules, and visualization. More recently, advanced studies concerning the
customer’s opinion and sentiment analysis have become very popular, since they provide
induced information about new implicit tendencies of users. In addition, surveys and
taxonomies of web data mining applications can be found that gathered and ordered existing
literature on this matter. More concretely, Market Basket analysis is one of the most
interesting subjects in e-commerce/e-sales, since it allows examining customer buying
patterns by identifying association rules among various items that customers place in their
shopping baskets. New trends in web mining analysis are mainly focused on the use of big data
and cloud computing services. It allows to manage large repositories of data commonly
generated in current web e-commerce services and associated social networks. In this sense,
Grant Agreement 315637
PUBLIC
Page 13 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
the analysis of customer’s behaviors and affinities in multiple linked sites of e-shopping, social
networks, e-marketing, security and online payment tools in digital ecosystems constitutes one
of the most promising research areas at present.
Current Web analytics solutions base their analyses on the data which are received in the
context of the e-shop. The interpretation of the numerous different types of data and its
visualization is quite complicated and needs to be done by the e-shop owners themselves if
they do not want to spend some money for an advisor. Furthermore, data are taking the lead,
the small e-shop owners need to understand how to make use of the big data. In this case, the
small e-shops need tools which suits them. The provided web analytics tools only partially
meet the requirements of small e-shops. The complexity also becomes obvious when
examining how often the e-shops analyze their web metrics. In order to attract more visitors to
the own e-shop and to offer them personalized content depending on the visitors’ needs, a
better understanding of the visitors of an e-shop becomes more and more a key factor for a
successful e-shop. However, understanding the visitors means to be able to analyze the
visitors’ behavior in the e-shop. Small e-shop owners need to overcome the complexity of web
analytics and the hurdle of developing the appropriate know-how of their usage. In order to
understand the visitors’ behavior and conduct appropriate actions, the project SME ECOMPASS should provide a support and an easy-to-use tool to facilitate the usage of web
metrics, enrich existing web metrics by additional data sources in order to derive appropriate
actions, and appropriately visualize the data and the action towards a decision support system.
The fundamental idea behind the SME E-COMPASS online data mining services is to support
small e-shops in increasing their conversion rates from visitor to customer by improving the:




understanding of the customers and their expectations/motivation,
knowledge about competitors and their activities, especially concerning their prices
and price trends,
examination of potentials for improvements by analysing some selected information of
both, customers and competitors,
initiation of appropriate actions depending on the identification of certain patterns in
the analysis results above-mentioned.
In order to implement a solution which supports the above-mentioned features the following
technological objectives are set related with dedicated modules of the system:
1. Collection of data from various data sources and its consolidation. Our aim is to collect
relevant data from various internal and external data sources of an e-shop. In order for
the data being analysed, the data need to be consolidated and made interpretable.
2. Collection of information of competitors and their products. For small e-shops, not only
the internal view on the e-shop, e.g. content and navigation structure, and its visitors
play an important role, also external aspects. Therefore, SME E-COMPASS develops
mechanisms which enable the e-shop owners to identify and collect relevant
information of competitors in the Web, such as product prices. Those mechanisms are
Grant Agreement 315637
PUBLIC
Page 14 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
integrated in the SME E-COMPASS cockpit ECC and made available to the other
modules of the online data mining service.
3. Business Scorecard – optimization potential analysis.
We aim to develop a target
group specific Business Scorecard which provides owners of small e-shops new insights
in their activities and an overview over new optimization potentials by analysing the
internal and external data from various sources in addition to the existing web
analytics information.
4. Automated procedures by applying rule-based actions. Usually for owners of small
e-shops, the monitoring of all crucial internal and external metrics becomes complex.
In order to facilitate the monitoring process of relevant metrics and certain patterns, a
rule-based solution is designed and implemented which additionally allows defining
automated actions which are initiated when certain situations occur.
5. Visualization of the results in the E-COMPASS cockpit. In order to be able to configure
the services, e.g. which competitors need to be observed and which products are
relevant, and present the BI results of the different analyses, the SME E-COMPASS
cockpit is designed.
6. Software-as-a-service application. Similar to the anti-fraud use case, our vision of the
online data mining services is to create a web-based service which provides the
additional features, information and results to the owners of small e-shops.
SME E-COMPASS Integration Framework
The higher integration task in the project is to develop a RDF repository which integrates all
required data from different-format data sources and making them available to the services
developed into the project (anti-fraud and data mining for e-sales). This RDF repository
integrates all the required data using RDF as the data model. Figure 0 depicts how the
repository is integrated within the two service applications.
Integrating data from multiple heterogeneous sources entail dealing with different data
models, schemas and query languages. An OWL ontology will be used as mediated schema for
the explicit description of the data source semantics, providing a shared vocabulary for the
specification of the semantics. In order to implement the integration, the meaning of the
source schemas has to be understood. Therefore, we will define mappings between the
developed ontology and the source schemas.
In case of online fraud application, the aim of the RDF repository is to make data from
different-format data-sources available to the anti-fraud algorithms. Data translators from RDF
to other formats will be developed when necessary, enabling the interchange of data among
algorithms dealing with different data models. Results of the algorithms will be also stored in
the RDF repository to make them also available to the rest of algorithms.
Grant Agreement 315637
PUBLIC
Page 15 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
In case of data mining for e-sales, the RDF repository stores data about online transactions and
user registries, to produce integrated data. These integrated RDF data will be translated to a
format that data mining tools can understand to enable the analysis of the data.
Figure 0: The RDF repository and its relations with the Project Work Packages
Grant Agreement 315637
PUBLIC
Page 16 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
1 Introduction
The main objective of this Work Package One (WP1) “SME E-COMPASS Framework” is to
design and deliver the project’s methodological framework and the necessary documentation
describing its impact on online fraud management and real-time data mining for SMEs. The
project team reviewed international scientific evidence as well as current best practices in the
e-business sector with the aim to identify and analyse opportunities and limitations of online
anti-fraud and data mining tools. A critical objective of this work-package is to obtain a deeper
understanding of the requirements and challenges faced by online SME merchants. The
outcomes of this analysis will become a basis for designing an evaluation framework for the
project’s applications and services as well as for their design and technological development.
1.1 About this deliverable
This report summarizes in a comprehensive manner both current approaches to fraud
prevention and data mining tools reported in the scientific literature as well as used in the ecommerce practice globally. It also illustrates the required models and theories that will be
implemented in the project’s applications and services for the benefit of SME Associations and
its members.
Furthermore SME E-COMPASS Methodological Framework report reflects the work effort
accomplished the first three months of the project in two separate tasks, namely:
Task 1.1: Models and Theories of Real-time Anti Fraud Systems
Under this task the technological partners of the consortium reviewed the academic literature
and available tools addressing the issue of online fraud detection. The research team
thoroughly presents best practices, weaknesses and limitations of current approaches in a
wide spectrum of technologies, such as expert rule-based systems, computational intelligent
models and hybrid architectures. In this task, the required models and theories to be designed
and be developed were defined, analysed and a pre-selection of them was justified linked to
specific scientific and technological objectives.
T1.2: Models and Theories for Real-time Data Mining as a Service
Real-time Data Mining as a Service aims at fostering e-sales operations by data analysis and
event processing. In this task, the required analysis of definitions, trends, current best
practices and data-mining techniques in the e-business sector was conducted. Furthermore,
the research team identified the current challenges posed to online SME merchants as well as
Grant Agreement 315637
PUBLIC
Page 17 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
the integration opportunities that will be given to European SMEs with the usage of the
project’s data mining tool. The initial methodology design for real-time data mining as a
service was scheduled and justified.
1.2 Document structure
This report follows a structure based on the work effort performed based on the
aforementioned tasks. Therefore Sections 3 and 4 are dedicated exclusively to each
application, the former covers the anti-fraud system and the latter the data-mining tool for esales. More analytically, each section covers the following topics and discussions:
Section 2 “Definitions”, introduces online fraud in e-commerce and provides definitions, the
size of the problem for e-merchants as well as statistics that highlight the obstacles for crossborder e-commerce as well as for the sustainability of small and medium e-shops. Furthermore
discusses the data mining and web analytics for e-sales operations basics by providing insights
and recent statistics from the global and European dimensions. The section concludes with the
semantic web, ontologies and their languages description and opportunities given from their
implementation within the project.
Section 3, “Analysis of online anti-fraud systems” presents the current trends and practices of
Fraud Detection Systems analysing the manual order review, data usage and the state-of-theart technologies such as the expert systems, supervised learning techniques, anomaly
detection technologies, hybrid architectures and semantic web technologies. The next
subsection is dedicated to the commercial products in place, briefly describing a dozen with a
comparative manner. Then, completed EC research projects relevant to the fraud detection
topic are documented. The section concludes with the weaknesses and limitations of current
practices lined to current SME needs.
Section 4, “Analysis of data mining for e-sales” begins with the presentation of the state-ofthe-art technologies and continues with the current trends and practices for e-sales and data
mining techniques. Additionally describes the commercial products in place such as web
analytics, data mining suites and tools for price search. Next sub-section focuses in recent
research project results as well as review of scientific literature on the domain. The section
finishes with the weaknesses and limitations of current practices compared to SME needs.
Section 5, “From Knowledge Harvesting to Designing SME E-COMPASS Methodological
Framework” is the final section of the report. Apart from recapitulating the principal
conclusions of the previous sections, this section defines the methodological framework of the
project in terms of technologies pre-selection, schematic depiction of the applications’
architecture and technological objectives. The section and the report’s main body finish with
the presentation and explanation of the project’s integration framework among RTD work
packages, data repository and overall applications.
The 6th Section of the report is an Appendix which presents web analytics techniques for
visitors’ behaviour analysis, metrics for customer behaviour analysis and a table classifying
Grant Agreement 315637
PUBLIC
Page 18 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
empirical studies employing state-of-the art fraud detection technologies. The section of the
references ends-up the document.
2 Definitions
This section defines the major business and technological terms that the project’s deals with
and presents insights and statistics that argues the need of advanced technological support to
European SMEs active in e-commerce. Initially online fraud is discussed, followed by data
mining techniques applied in e-commerce operations for merchants as well as the semantic
web contribution for supporting data analysis in e-commerce industry.
2.1 Online Fraud
According to a mainstream definition, fraud is the “wrongful or criminal deception intended to
result in financial or personal gain”1. The advent of internet technology and the popularization
of online sales have resulted in an increase of fraudulent activity, often leading to the outbreak
of new types of criminal behaviors over the web. What is nowadays called online or internet
fraud is a constant plague for e-commerce, despite the various efforts that have been made in
the directions of developing new anti-fraud technologies and reinforcing the legislative
framework. This is mainly because fraudsters are highly adaptive to current defensive
measures, constantly devising new tactics for breaching a security system.
Nowadays, there are various ways is which malicious behaviors manifest themselves over the
web, so that it becomes difficult to come up with a comprehensive taxonomy. Among the most
popular and well-documented types of internet fraud - with particular relevance to ecommerce - are account takeover, phishing, pagejacking and credit-card-related frauds2.
Account takeover occurs, for example, when a fraudster gains access to an e-shop customer’s
account, by obtaining credentials and other personal information from the legitimate holder.
He/she can then alter the configurations of the existing account (e.g. add new users or change
the postal address) and perform unauthorized transactions pretending to be the authentic
customer. Phishing is a type of fraud by which victims are prompted by fake emails, phone
calls, text messages or redirects to fraudulent web sites to disclose personal information
through which criminals can make profit. Pieces of information that are typically “fished” out
are username, password, identity card details, credit card details, PIN codes, etc. Through
pagejacking a hacker can create a malicious “clone” of an e-shop’s web site and try to “steal”
customers from the original shop (e.g. through redirection of search engines) to their own
detriment3. Fraud related to payments by credit card is perhaps the most common concern of
both merchants and customers and will be analyzed in a separate paragraph below.
1
See www.oxforddictionaries.com.
See e.g. http://www.actionfraud.police.uk/a-z_of_fraud for an “a-to-z” discussion of various types of
internet fraud.
3
See e.g. http://www.marketingterms.com/dictionary/.
2
Grant Agreement 315637
PUBLIC
Page 19 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
There also exist other types of fraud, such as cash-on-delivery and lost/stolen goods, that
although not being conducted over the web they too cause trouble to online retailers. Cashon-delivery is a popular practice by which payments are made when goods are received by the
buyer. No matter how safe it may seem, it is the source of many risks for both the retailer and
the customer. A buyer, for instance, may be prompted to pay for a good that is defective or in
a bad condition. Conversely, a seller may receive and execute an order for a product which the
customer cannot afford and hence is unable to pay for at the time of delivery4. The
lost/stolen-goods fraud - or “goods lost in transit” (GLT) scam as others call it5 - arises when
the customer claims that the ordered goods have never been delivered, whereas in fact they
have, or that they have been fakely compromised by a third party. In any case, the person who
raises this claim aspires for some compensation from the merchant’s side or for making money
from selling the seemingly stolen good. Similarly, GLT fraud can be committed by unfaithful
sellers claiming that they have never received a returned good.
Among the various types of fraud analyzed above, those related to credit card payments are
undoubtedly the most frequently encountered and difficult to deal with. Traditionally, credit
card frauds have involved illegal usage of the physical card6. This could incur, for instance, by
physically compromising someone else’s card (card theft), asking for a duplicate copy
(duplicate fraud), applying for a new card to be issued at someone else’s name (identity
theft), using the card while being unable to redeem the amount of purchased goods
(bankruptcy fraud), falsely alleging that your newly issue has never been received (never
received issue). With the increasing popularity of telephone/online sales, it has also become
possible to commit fraud even when you do not acquire possession of the plastic card – you
simply need to know its details. This type of transaction broadly termed as card-not-present
(CNP) is nowadays considered to be one of the main fraud channels especially in e-commerce
(Bolton and Hand, 2002).
A typical online shopping system allows the customer to select a basket of products/services
and then asks for credit card details to execute the order. All payment-related data are routed
to the merchant’s acquiring bank which is responsible for settling the transaction. However,
even if the order has been cleared, the customer has the right to reverse the transaction
because the item/service received does not meet the initial standards or claiming theft of
his/her card details. This is a typical case of a chargeback process initiated by the issuing bank
for the compensation of the card holder. If the money-back claim gets through, the merchant
incurs losses as the service/good has already been provided but its monetary value has to be
refunded7. But, even if the retail company manages to win the case, it still has to shoulder the
4
http://www.ehow.com/how_2337528_avoid-falling-victim-cash-delivery.html
http://www.transactis.co.uk/blog/viewpoints/goods-lost-in-transit-glit-fraud-a-new-retail-threat-for-anew-technological-age/
6
See Bolton and Hand (2002), Delamaire et al. (2009), Sahin and Duman (2011) and Pavía et al. (2012)
for a taxonomy and discussion of various types of credit-card-related fraud.
7
See Lei and Ghorbani (2012), http://usa.visa.com/merchants/merchant-support/dispute-resolution/chargeback-cycle.jsp or https://www.unibulmerchantservices.com/chargeback-management/ for
illustrative representations of the chargeback process and the parties involved therein. In practice, the
actual financial consequences of a chargeback depend on when the claim is received. If the claim is
received fast enough, the merchant may manage not to bear any loss at all. However, the claim
5
Grant Agreement 315637
PUBLIC
Page 20 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
costs of processing and defending the claim8. A chargeback is typically regarded as a customer
protective measure, but it can also be the source of abuse from deceitful card holders. This
happens e.g. when the customer purchases goods/services using a valid credit card and then
disputes the charge claiming that his card has been used in an unauthorized manner.
Credit-card payment and other types of online fraud entail risks and losses for all “rings” of the
e-commerce chain: online merchants, customers, issuing and acquiring banks. In addition to
that, they lead to societal costs, as they threaten the very existence of e-commerce: the
customer’s faith on internet as a reliable and viable sales channel.
According to EC’s “Flash Eurobarometer Survey9” on consumers’ attitudes towards crossborder trade, more than half (53%) of European consumers have made at least one online
purchase in the twelve months preceding September 2012. This proportion has almost
doubled since 2006. Furthermore the same survey reveals a fast uptake of e-commerce in all
27 Member States, with the strongest development observed in Slovakia, Ireland, Poland, the
Czech Republic and Cyprus. The Internet is used to make purchases mainly from sellers or
providers based in the respondent's own country. The proportion of respondents who make
purchases from domestic vendors has grown from 23% in 2006 to 47% in 2012. More than the
half of Europeans (59%) feels confident about buying something online from a domestic
vendor, but not from a vendor in another EU country. Only 36% feel confident about
purchasing via the Internet from a vendor located in another EU country.
A major reason for this introvert commercial attitude is online fraud. As the 2013 LexisNexis®
True Cost of Fraud(SM) Study characteristically reveals, almost one in three consumers whose
identity has been theft seek no further collaboration with the online store through which they
have had this unpleasant experience10. Furthermore another recent study from Aite Group LLC
in conjunction with ACI payment systems11 highlights that after experiencing online fraud, 61%
of cardholders chose to use cash or an alternate form of payment instead of their card,
regardless of the satisfaction level customers felt with their card provider after the fraud
experience; curtailed card use is for many a lingering impact of the fraud experience. This
aforementioned ACI Worldwide study of 5,223 consumers in 17 countries provides an
overview of respondents’ attitudes toward various types of financial fraud and discusses the
notification usually arrives extremely late, in which case the merchant has to bear the cost for the whole
value of the service/product - a direct hit to the merchant’s bottom line.
8
According to the CyberSource’s “2011 Online Fraud Report” (http://www.cybersource.com/current_resources), US and Canadian merchants are typically successful in only 41% of the chargeback
cases they contest.
9
Flash Eurobarometer 358 “Consumer attitudes towards cross-border trade and consumer protection”
accessed via http://ec.europa.eu/public_opinion/flash/fl_358_en.pdf , survey conducted by TNS
Political & Social at the request of the European Commission, Directorate-General for Health and
Consumers, published June 2013
10
Available from http://www.lexisnexis.com/risk/downloads/assets/true-cost-fraud-2013.pdf.
11
Aite Group LLC in conjunction with ACI payment systems report “Global Consumers React to Rising
Fraud: Beware Back of Wallet”, available from:
http://www.aciworldwide.com/~/media/files/collateral/aci_aite_global_consumers_react_to_
rising_fraud_1012?utm_campaign=&utm_medium=email&utm_source=Eloqua , published
October 2012
Grant Agreement 315637
PUBLIC
Page 21 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
actions they may take subsequent to a fraud experience. In total, 5,223 consumers were
included in the research: approximately 300 consumers, divided equally between men and
women, participated in each of the 17 countries. After experiencing card fraud, some
cardholders tend to use cash or an alternate payment method instead of the type of card with
which they had experienced the fraud incident. Cardholder behavioral changes after they
experience fraud are important to understand due to the implications for card profitability for
the issuer. Changes in behavior are common in Italy and in Germany, where 50% of customers
used their cards less often after the fraud experience. Lower percentages but also significant
appears in other western European Countries, 39% in United Kingdom, 38% in Sweden and
17% in the Netherlands.
Up to now a brief overview of official statistics has been presented analyzing consumers’
reaction towards fraud and its implications for credit card issuers and e-commerce retailers.
For the latter, the following dedicated surveys pinpoint the dimensions of the online fraud
impact in their daily operations. Another recent EC Flash Eurobarometer survey12 on “Retailers’
attitudes towards cross-border trade and consumer protection” states that in Europe it is likely
that e-merchants are selling only to consumers in their own country. One quarter of retailers
(25%) sells cross-border to consumers, and there has been slight decrease in this proportion
since 2011 (-2 percentage points). Higher costs of fraud and non-payment (41%) as well as
costs of compliance with different consumer protection rules and contract law (41%) are the
most mentioned obstacles to cross-border electronic trade development. Mentions of these
obstacles have increased since 2011 by 9% and 7% respectively. Retailers who already trade
with at least one other EU country rank the potentially higher costs of the risk of fraud or nonpayment as the most important obstacle (51%). Overall 24% of retailers plan to sell crossborder in the next 12 months. However, 9% of retailers that are currently selling cross-border
do not plan to continue in the next 12 months.
As far as the financial consequences of online fraud in e-commerce are concerned,
CyberSource (Visa company) annual surveys are measuring fraud in North America and in UK.
The latest (2013) CyberSource’s fraud report13 for United States and Canada reveals that for
2012 companies reported a loss average of 0.9% of total online revenue due to fraud, similar
to 2010 levels. The same report using 2012 industry market projections on e-commerce sales
in North America, estimates that total revenue loss is approximately 3,5 billion US-dollar, while
the average fraud rate by revenue is estimated to 0.9%. Out of this figure 43% are chargebacks
and 57% are credits issued directly to consumers by companies. Findings of the CyberSource
12
Flash Eurobarometer 359 “Retailers’ attitudes towards cross-border trade and consumer
protection” accessed via http://ec.europa.eu/public_opinion/flash/fl_359_en.pdf, survey conducted by
TNS Political & Social at the request of the European Commission, Directorate-General for Health and
Consumers, published June 2013
13
CyberSource Corporation a Visa company, ” Online Fraud Report: Online Payment Fraud
Trends, Merchant Practices, and Benchmarks”, available from
http://www.cybersource.com/current_resources/, published in 2013
Grant Agreement 315637
PUBLIC
Page 22 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
2013 report on UK14 shows that 1.65% of e-commerce revenues are lost due to fraud and 4%
of overall orders are rejected on suspicion of fraud.
On the grounds of the evidence provided above, it becomes crucial for e-commerce actors to
design systems or processes that could either stop fraudulent activity in the first place or be
able to detect it early before its consequences escalate.
2.2 Data Mining and web Analytics for e-Sales Operations
When considering data mining techniques and services for E-sales, it is important to
understand the development of the e-commerce activities in the different regions within
Europe and the constantly increasing competition among the sellers in and across those
markets. In the past ten to fifteen years, the importance of e-commerce activities of traditional
companies often increased to a considerable share of the total revenue and developed to an
important sales channel to maintain IT-affine customers who like to electronically buy over the
web. The web as a sales channel allows next to the traditional businesses, new companies, i.e.
online start-ups, and virtual organisation, i.e. business networks, to emerge which established
new business models and thus new ways of selling products and services online (Weill & Vitale,
2013).
When competing in those non-transparent markets, companies need to work hard for being
visible for their customers. Therefore, a constant optimization of the current processes and
activities, e.g. marketing campaigns, order process, and shipping process, is essential as well as
identifying potentials to address new customers and maintain existing ones (Cavalcante,
Kesting, & Ulhøi, 2011).
In this context, it is important to examine the own strength and weaknesses from a customer
point of view. Therefore, a close look at the competitors and a good understanding of the
potential and existing customers help to successfully position the own business in the target
markets (Wilson & Gilligan, 2005). A constant monitoring of the activities of competitors, own
visitors and own customers in order to maintain and improve the own business activities is
required. Below, there are some key figures of markets in general, e-commerce players and
buyers which are relevant when designing and developing a data mining service for e-sales.
Importance of e-commerce increases worldwide and in Europe
The importance of the World Wide Web as a sales channel with e-payments has been
constantly increasing over the past years. The largest segment of e-payments is the consumerto-business (C2B) payments, which are used mainly for goods purchased in online stores, and
are being driven by the fast growing global e-commerce market. The market is expected to
14
CyberSource Corporation a Visa company, ” 2013 UK eCommerce Fraud Report”, available
from http://www.cybersource.com/current_resources/, published in 2013
Grant Agreement 315637
PUBLIC
Page 23 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
grow15 by 18.1% from 2010 (when transactions numbered 17.9 billion, see Figure 1) per year
until 2014 (with an estimated total of 34.8 billion and a value of 1,792.4 billion US-dollar). This
growth could be compromised by concerns about online fraud and the high dropout rates of
consumers buying online. Dropout rates–of up to 60%10–among online buyers could be
reduced with the development of more convenient payment methods by payment services
providers (PSPs) as well as with advanced anti-fraud systems interfering among the
e-merchant and the customer.
40
34,80
35
29,90
30
25,40
25
20
21,30
17,90
15
10
5
0
2010
2011
2012
2013*
2014*
E-commerce figures include retail sales, travel sales, digital downloads purchased via any digital
channel and sales from businesses that occur over primarily C2C platforms such as eBay. Chart
numbers and quoted percentages may not add up due to rounding.
Source: Capgemini Analysis, 2013; http://www.emarketer.com/Article/Ecommerce-Sales-Topped-1Trillion-First Time-2012/1009649 ; “Edgar Dunn advanced payments”, 2011;
http://www.finextra.com/News/FullStory.aspx?newsitemid=24499;
http://mashable.com/2011/02/28/forrester-e-commerce/ ; http://econsultancy.com/in/blog/61696-
Figure 1: Number of Global E-Commerce Transactions (Billion), 2010–2014F
In 2012, the B2C e-commerce revenue worldwide summed up to 1,043 billion US-dollar with
an expected 78 percent increase up to the year 2016 (see Error! Reference source not found.).
15
According to the World Payments Report of 2013 drafted by CapGemini and Royal Bank of Scotland,
accessible at http://www.capgemini.com/resource-file-access/resource/pdf/wpr_2013.pdf
Grant Agreement 315637
PUBLIC
Page 24 of 144
SME E-COMPASS
2000
1800
1600
1400
1200
1000
800
600
400
200
0
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
1.859,75
1.654,88
1.444,97
1.221,29
1.042,98
856,97
2011
2012
2013*
2014*
2015*
2016*
The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games.
Figure 2: B2C e-commerce revenue worldwide in 2011 and 2012 and the forecasts until 2016 (in billion
US-dollar) (eMarketer, 2013a)
In comparison, the B2C e-commerce revenue in Europe in 2012 came to 256 billion US-dollar
with an expected 51 percent increase up to the year 2016 (see Error! Reference source not
found.Error! Reference source not found.). The figures show that the B2C e-commerce
activities within the European market will significantly develop. However, the European B2C ecommerce evolves much slower than other regions of the world.
450
400
326,13
350
291,47
300
250
387,94
358,31
255,59
218,27
200
150
100
50
0
2011
2012
2013*
2014*
2015*
2016*
The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games.
Figure 3: B2C e-commerce revenue in Europe in 2011 and 2012 and forecasts until 2016 (in billion USdollar) (eMarketer, 2013b)
The main growth region in terms of B2C e-commerce is Asia-Pacific for which experts forecast
a growth of 124 percent from 2012 to 2106. The large market in the US increases with 55
percent very similar to the European market. However, Western Europe will reach the B2C
e-commerce revenue of US in 2012 after 4 years in 2016 (see Figure 4Error! Reference source
not found.).
Grant Agreement 315637
PUBLIC
Page 25 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Concerning the ECommerce Europe (www.ecommerce-europe.eu), in 2012 European B2C
e-commerce, including online retail goods and services such as online travel bookings, vents
and other tickets, downloads etc. grew by 19.0 percent to reach 311 billion Euro (which is
equivalent to 426 billion US-dollar taking an exchange rate of 1.37, stand: 26.2.2014)
(Weening, 2013).
800
707,60
700
600
580,24
500
400
387,94
373,03
315,91
300
255,59
200
100
68,88
40,17
69,60
37,66
Central- and
Eastern Europe
Latin America
45,49
20,61
0
US
Asia-Pacific
Western Europe
2012
Middle East and
Africa
2016
The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games.
Figure 4: B2C e-commerce revenue depending on certain regions of the world in 2012 and forecasts
until 2016 (in billion US-dollar) (eMarketer, 2013a)
The largest markets of B2C e-commerce are UK with 96 billion Euro, Germany with 50 billion
Euro, France with 45 billion Euro, and Spain with 13 billion Euro. (Weening, 2013)
European Region
West
Central
South
North
East
Total Europe (47)
Total EU (28)
2012
160,8
76,3
32,4
28,7
13,4
311,6
276,5
Growth
15.8%
20.5%
29.3%
15.1%
32.6%
18.8%
18.1%
Table 1: European B2C e-commerce revenue of goods and services (in million Euro and percentage of
growth) (Weening, 2013)
Grant Agreement 315637
PUBLIC
Page 26 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Number of Online-buyers increases
When having a closer look on the second largest market in Europe, Germany, the share of
online buyers of the whole population comes to 73 percent which means that over two third
of the population purchases online.
80
70
60
50
40,9
40
25,3
30
20
45,1
49,6
54,1
58,8
63,3
62,0
63,8
65,5
70,8
72,8
30,2
9,7
10
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Figure 5: Share of online buyers of the whole population in Germany from 2000 to 2013 (Institut für
Demoskopie Allensbach, 2013)
One third of purchases is done over the web by the youngest generations. Therefore, a great
potential for increasing B2C e-commerce becomes obvious especially when those generations
grow older and maintain their purchasing behaviour.
35
32,00
32,00
29,00
30
26,00
25
22,00
20
17,00
15
10
5
0
16 - 24 y.o.
25 - 34 y.o.
35 - 44 y.o.
45 - 54 y.o.
55 - 64 y.o.
Über 65 y.o.
Figure 6: Share of online purchases in comparison to the overall purchases per age group in Germany in
2012 (Bundesverband Digitale Wirtschaft (BVDW) e.V., 2012)
Top product groups of e-commerce are textile, clothing and consumer electronics
When examining the products which are sold over the web in Germany, the two largest groups
are textile and clothing products and consumer electronics/e-articles, followed by computer
and accessories, books and hobby, collection and leisure articles.
Grant Agreement 315637
PUBLIC
Page 27 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
The product type influences the customers, the competition, and thus, how products are sold
over the web. For example, products especially with well-known brand names can be quickly
found at different retailers and wholesalers in the web. The customers make their buying
decision based on comparing prices. A differentiation from competitors is only possible by
offering the cheapest price and/or additional services. Some services may be related to the
product (e.g. maintenance, hotline), transaction (e.g. terms & conditions) and e-shop (e.g.
additional functions which support the customers in their decision making process) (Mikians,
Gyarmati, Erramilli, & Laoutaris, 2012).
In the case the retailers sell products which are not branded, the customers need to match
similar products in a first step and check whether their features fulfil the main requirements,
before they can compare them in a second step by examining the above mentioned criteria,
price and services. In this case, on the one hand, the comparison of products is much more
difficult and requires more effort by potential customers, on the other hand the sellers have
got more possibilities to differentiate their products (Aanen, Nederstigt, Vandić, & Frasincar,
2012; Nah, Hong, Chen, & Lee, 2010).
5.960,00
textile and clothing products
4.600,00
3.540,00
consumer electronics/E-articles
2.570,00
2.280,00
2.060,00
computer and accessories
2.190,00
1.970,00
books
1.980,00
1.480,00
hobby, collection and leisure articles
shoes
1.270,00
1.110,00
furniture and decorative goods
1.230,00
780,00
household appliance
990,00
720,00
telecommunication, handy and accessories
970,00
500,00
DIY/ garden/ flowers
960,00
740,00
video or audio recordings
910,00
790,00
810,00
740,00
car and motorcycle/accessories
0
1.000
2012
2.000
3.000
4.000
5.000
6.000
7.000
2011
Figure 7: Top 20 product groups in e-commerce depending on revenue in Germany in 2012 (in million
Euro) (bvh, 2013b)
Grant Agreement 315637
PUBLIC
Page 28 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Only a few key players in the markets
Amazon
22,73
Ebay
21,44
Otto
5,48
Tchibo
3,76
Zalando
3,73
Bon Prix
3,25
Lidl
3,10
Media-Markt
2,50
Ikea
2,46
Weltbild
2,33
0,0
5,0
10,0
15,0
20,0
25,0
Figure 8: Visitor numbers of the largest e-shops in Germany in June 2013 (in million)
(lebensmittelzeitung.net, 2013)
Error! Reference source not found. presents the sales figures of the 10 top e-shops in
Germany in the year 2012. Obviously, there are two main players in the market, Amazon and
Ebay, who maintain a high market share, whereas already the following 8 e-shops only achieve
a quarter of the visitors in comparison to the TOP2.
90
85,50
80
70
62,70
60
50
40
32,30
30
20
10
0
Top 10
Top 100
Top 500
Figure 9: Revenue share of the TOP10, TOP100 and TOP500 e-shops of the whole market in Germany in
2012 (EHI Retail Institute, Statista, 2013)
When comparing the revenue, the TOP10 e-shops in Germany reach a third of the overall
revenue of the market. The TOP500 e-shops cover 85 percent of the market which results into
42.5 billion Euro based on the overall market revenue of 50 billion Euro mentioned above. 7.5
billion Euro revenue is shared by approx. 150,000 other e-shops in Germany (EHI Retail
Institute, Statista, 2013).
Grant Agreement 315637
PUBLIC
Page 29 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
The Federal Association of the German Retail Trade differentiates between 8 types of vendors
(bvh, 2013a):








Multi-Channel-Vendors (MCV)
Internet-Pure-Player (IPP)
Vendors who have their origin in over-the-counter retail (OCR)
Ebay-Powerseller (EPS)
Teleshopping Vendor (TSV)
Manufacturing Vendors (MGV)
Online Pharmacies (OPS)
New Multi-Channel-Vendors (MCVnew=MCV+OPS+OCR+TSV)
The Internet-Pure-Providers (IPP) increased their revenue by 40 percent between 2011 and
2012, and the Vendors who have their origin in over-the-counter retail (OCR) by 22 percent.
The other types of vendors increased their revenue from 2 to 13 percent. The IPP and the OCR
are both vendor types which are considered as rather small enterprises which usually do not
have many resources and/or competences in all of the relevant areas of e-commerce,
concerning retailing and information technologies.
Data and data mining in e-sales
Every e-shop owner needs to compete in a much broader regional or even national context in
comparison to the traditional sales of products over conventional stores. On the one hand,
identical or at least similar products are offered over the web and the product information can
be retrieved and compared with the offers of competitors by potential customers within
seconds and without great effort. On the other hand, the customers’ demand changes from
time to time and sometimes very fast. Thus, e-shop owners need to identify those changes and
react appropriately.
In order to successfully position the own e-shop in such a competitive environment, relevant
information about the competitors and the own (potential) customers are essential. Precise
knowledge of the customers’ preferences, for this reason, must be gathered by the owners of
e-shops to find out to whom (potential customers), what (products and services), how
(marketing channels and design of the e-shops) and when (time) to address the target groups.
Therefore, the sales process requires a deep data analysis to know the “consumer decision
journey” (Carmona et al., 2012).
This requires precise knowledge of the customer´s preferences, for this reason, holders of
e-shops must find out to whom, to what, to how and to when to refer to the customer.
Therefore the sales process requires a deep data analysis to know the “consumer decision
journey”.
This knowledge has to then be converted into intelligence and, if possible, entertaining
presentation of the information wanted by the customer and without overstraining or
understraining him (Perner & Fiss, 2002).
Grant Agreement 315637
PUBLIC
Page 30 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
In an e-commerce site data are available across the merchandising data, marketing data,
server data, and web meta-data. When a customer visits a web site he leaves a trace of data
which can be used to understand the customer needs, desires and demands as well as to
improve the own web presence and e-shop. In order to understand the visitors and customers
of the own e-shop better, the collected data must be analysed by applying data mining
techniques and algorithms in order to identifying optimization potential and improving the
own marketing and sales processes, the content of web site and e-shop, and the ITinfrastructure (Hassler, 2012).
Data mining technologies can be applied in the context of e-commerce, in order to support
these optimization processes.
Definition: The most commonly accepted definition of “data mining” is the discovery of
“models” for data. (Rajaraman, Leskovec, & Ullman, 2013)
Rajaraman et al. (2013) mentions different perspectives on data mining, e.g. statisticians view
data mining as the construction of a statistical model, that is, an underlying distribution from
which the visible data are drawn. Machine-learning practitioners use the data as a training set,
to train an algorithm of one of the many types used by machine-learning practitioners, such as
Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.
More recently, computer scientists have looked at data mining as an algorithmic problem. In
this case, the model of the data is simply the answer to a complex query about it. Most other
approaches to modelling can be described as either:
1. Summarizing the data succinctly and approximately, or
2. Extracting the most prominent features of the data and ignoring the rest.
When examining data mining for e-sales the following issues become relevant:
1. Data gathering – collecting valuable information for further analysis
a. Conversion information, i.e. information about where the visitor came from
and why (e.g. based on keywords used in search engines)
b. User behaviour information (e.g. usage statistics from web analysis tools)
c. Competitor information (e.g. pricing information from price search engines)
2. Data extraction and analysis – finding relevant data and correlations within the
gathered data
3. Automatized reaction to data analysis, e.g. automatic changes of own prices based on
competitor information
4. Information presentation/visualisation
The details for each of those aspects are explained in section Error! Reference source not
found. Error! Reference source not found..
In order to monitor the (potential) buyers, e.g. visitors and customers on the own e-shop,
several web analytics tools have been developed. Web analytics tools gather web usage data,
analyse and visualize them. Thus, web Analytics can be considered as a part of data mining
Grant Agreement 315637
PUBLIC
Page 31 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
which adopts very similar technologies. Many different definitions of the term web analytics
exist. However, one well-known definition has been defined by the Data Analytics Association
as followed.
Definition: The Web Analytics Association (2008) has defined web analytics as the
measurement, collection, analysis and reporting of internet data for purposes of
understanding and optimizing web usage.
The two objectives of web analytics are on the one hand the monitoring of the visibility of the
web site and campaigns, on the other hand the identification of potential for optimizing the
web presence.
Two main wide-spread techniques exist to conduct web analytics (Bauer et al., 2011):
1. Web server logfile analysis: web servers record some of their transactions in a logfile
which can be read and analysed toward certain attributes of e-shop visitors.
2. Page tagging: Concerns about the accuracy of logfile analysis while browsers apply
caching techniques, and the requirement to integrate web analytics as an cloud
service, let the second data collection method emerge, page tagging or 'web bugs'.
In the past, web counters, i.e. images included in a web page that showed the number
of the image’s requests as an estimate of the number of visits to that page, were
commonly used. Later on, a small invisible image has been used with JavaScript to pass
along certain information about the page and the visitor with the image request. This
information can then be processed and visualized by a web analytics service.
The web analytics service also needs to process a visitor’s cookies, which allow a
unique identification during his visit and in subsequent visits. However, cookie
acceptance rates significantly vary between Websites and may affect the quality of
data collected and reported.
The details of each of the method are introduced at 6.1 Web analytics techniques (for visitors
behaviour analysis). Other methods and techniques, such as conversion paths (funnel), click
path analyses, clickmap, heatmap, motion player, attention map, visibility map, visitor
feedback are additionally applied for specific purposes (Bauer et al., 2011).
The metrics which can be measured with web analytics techniques have been constantly
further developed. For example, the Web Analytics Association introduced a paper about
metrics which play an important role in web analytics from their view (Web Analytics
Association, 2008) as well as “ibi research” created a list of crucial metrics (Bauer et al., 2011).
Four main categories of metrics are very similar:

Metrics for visit characterization: The terms in this section describe the behavior of a
visitor during a web site visit. Analyzing these components of visit activity can identify
ways to improve a visitor's interaction with the site.
Grant Agreement 315637
PUBLIC
Page 32 of 144
SME E-COMPASS



D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Metrics for visitor characterization: The terms in this section describe various
attributes that distinguish web site visitors. These attributes enable segmentation of
the visitor population to improve the accuracy and usefulness of analysis.
Metrics for engagement: The terms in this section describe the behavior of visitors
while on a web site. However, they differ from the “visitor characterization” terms in
that they are often used to infer a visitor’s level of interaction, or engagement, with
the site.
Metrics for conversion: Conversion terms record special activities on a site, such as
purchases, that have particular business value for the analyst. They often represent
the bottom-line “success” for a visit.
The metrics in detail of both sources are listed at 6.2 Metrics for customer behaviour analysis.
E-sales and e-marketing
In order to successfully conduct e-marketing activities on the basis of the collected data, it is
necessary to have some know-how of traditional marketing, computer sciences, and also of
analytic techniques and methods.
E-marketing is the concentration of all efforts in the sense of adapting and developing
marketing strategies into the web environment. E-marketing involves all stages of work
regarding a web site, such as the conception, the projects itself, the adaption of the content,
the development, the maintenance, the analytical measuring and the advertising (Strauss,
Frost, & Ansary, 2009).
The need to develop specific marketing strategies for the internet implies that some traditional
principles are adapted, or even reinvented. To keep the customer’s attention about the web
presence requires to build up a strong customer relationship and to offer services which
attract the customer to visit the web site frequently and purchase products and services.
Four activities facilitate the deployment of e-marketing strategies (Stolpmann, 2001):




Online promotion, where the aim of online promotion is to bring an advertisement
message which is targeted to specific customer group quickly and cost-effective to this
group;
Online shopping, it is the selling of products or services via internet which are at least
a product catalogue and a safe and error-tolerant transaction line for ordering and
paying the products and services;
Online service, the service provided via internet can be free or chargeable and besides
can be accessed from everywhere in the world at any time;
Online collaboration, where users are enabled to get into contact with other users or
the seller and they can expose their opinion of the products or services.
The following work focuses on data mining services which support the e-sales and e-marketing
activities of e-shops and support the offering of appropriate services which are valued by the
customers. Appropriate services can be applied by gathering relevant information of visitors,
Grant Agreement 315637
PUBLIC
Page 33 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
customers and their behaviour and an analysis of this information in order to identify
optimization potential or initiate actions to support e-sales and e-marketing. In section 4
Analysis of data mining for e-sales, the topic data mining and relevant aspects when
considering the application of data mining for e-sales activities are further elaborated.
2.3 Semantic Web
The web is the biggest information system ever known, and it is always growing and changing.
However, most information on the web is designed for human consumption. Leaving aside the
artificial intelligence problem of training machines to behave like people, the Semantic web
approach instead develops languages for expressing information in a machine processable
form (Berners-Lee et al, 2001). The semantic web is a web, the content of which can be
processed by computers. It can be thought of as an infrastructure for supplying the web with
formalized knowledge in addition to its actual informal content.
The Semantic web perspective has been defined in a layered tower (Figure 100). This tower
keeps the URI (Uniform Resource Identifier) as the basis of the Semantic web. On top of this
layer, there are two choices for data representation and interchange: RDF (Resource
Description Framework) and XML. This is interesting because the (new) use of RDF as an
interchange format (and not only for metadata) opens new perspectives for the
implementation of applications, and makes it possible to use Semantic web query languages to
access this data. In this sense, SPARQL is proposed as the language for querying RDF data. The
introduction of rules though the “Rule: RIF" layer enables the definition of complex rules that
will allow applications to perform more sophisticated mechanisms to infer new knowledge.
Another novelty is the explicit declaration of the need to produce applications and user
interfaces to make the Semantic web a real product.
Figure 10. The Semantic web Tower
Grant Agreement 315637
PUBLIC
Page 34 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
2.3.1 Linked Data
The development of the Semantic web has been focused on the cooperation between
computers rather than the cooperation between computers and people. These differences
between the objectives and the results lead to the idea of Linked Data (Heath and Bizer, 2011),
which aims to provide a practical solution to the semantic annotation of web data. Linked Data
represents a way for the Semantic web to link the data that are distributed on the web, so that
is referenced in a similar way as surfing the web via HTML pages. Thus, the goal of the
Semantic web goes beyond the simple publication of data on the web, linking some data with
others, allowing people and machines to explore the web of data and access information
relating to reference from other initial data.
In the web of hypertext (or web of documents) links are relationships between points in the
documents written in HTML. In the web of data links between the data are relationships
between anything that is described in RDF, transforming the web into a (kind of) global
database. This new conception of the web has been defined by Tim Berners-Lee as the "Giant
Global Graph" which uses Semantic web techniques to produce data linked by encouraging the
publication of large amounts of semantics and semantically annotated data.
Linked Data, supported by an active community, encourages the application of standards that
can be summarized as follows: (i) the use of HTTP URIs, (ii) the SPARQL query language and (iii)
Resource Description Framework (RDF) and web Ontology Language (OWL) for data modeling
of data and representation.
2.3.2 Ontologies
Ontologies provide a formal representation of the real world, shared by a sufficient amount of
users, by defining concepts and relationships between them. In the context of computer and
information sciences, ontology defines a set of representational primitives with which to
model a domain of knowledge or discourse.
The representational primitives are typically concepts (or classes), attributes (or properties),
class members (class instances) and relationships (property instances). The definitions of the
representational primitives include information about their meaning and constraints on their
logically consistent application.
The term “ontology” comes from the field of philosophy that is concerned with the study of
being or existence. In computer and information science, ontology is a technical term denoting
an artifact that is designed for a purpose, which is to enable the modeling of knowledge about
some domain, real or imagined.
Ontologies are part of the W3C standards stack for the Semantic web, in which they are used
to specify standard conceptual vocabularies in which to exchange data among systems,
provide services for answering queries, publish reusable knowledge bases, and offer services
to facilitate interoperability across multiple, heterogeneous systems and databases.
Grant Agreement 315637
PUBLIC
Page 35 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
In order to provide semantics to web resources, instances of ontologies classes and properties
(expressed as RDF triple16) are used to annotate them. These annotations over the resources,
which are based on ontologies, are the basis of the Semantic web. The ontology reasoning
capabilities allow Semantic web applications to infer implicit knowledge from that explicitly
asserted, enabling a new more-advanced kind of applications. Querying and reasoning on
instances of ontologies will make the Semantic web useful.
Ontologies are often very complicated, and are difficult to write, maintain and compare. The
problem of building an ontology is the same as the problem of building a model of the
important elements of that organization. There will be different ways of looking at the
organization, and there will be different priorities for different people. Then, as you get more
information, your view of the organization may change, or the organization might be
restructured, requiring that you have to rewrite the ontology. The problem is rather like
deciding on the structure of a relational database and then perhaps having to reorganize it
after you have added lots of data.
2.3.3 Web ontology languages
Ontologies play a crucial role in the development of the web. This had led to the extension of
mark-up languages in order to develop ontologies. Examples of these languages are RDF and
RDFS. RDF is a graphical language used for representing information about resources on the
web. It is a basic ontology language. Resources are described in terms of properties and
property values using RDF statements. Statements are represented as triples, consisting of a
subject, predicate and object. RDF Schema “semantically extends” RDF to enable us to talk
about classes of resources, and the properties that will be used with them. It does this by
giving particular meanings to certain RDF properties and resources. RDF Schema provides the
means to describe application specific RDF vocabularies.
RDF and RDF Schema provide basic capabilities for describing vocabularies that describe
resources. RDFS is commonly used for describing metadata and simple ontologies. However,
the expressivity of RDFS is not enough for several applications. Still, it provides a good
foundation for interchanging data and enabling true Semantic web languages to be layered on
top of it. However, certain other capabilities are desirable e.g., Cardinality constraints,
specifying that properties are transitive, specifying inverse properties, specifying the “local”
range and/or cardinality for property when used with a given class, the ability to describe new
classes by combining existing classes (using intersections and unions), negation (using “not”).
Besides, ontology languages must fulfill some other requirements such as present a Welldefined syntax, convenience of expression, formal semantics which is needed in reasoning,
efficient reasoning support and sufficient expressive power.
16
A triple is an RDF statement which contains a subject, a predicate and an object about a resource
where the subject is the resource itself, the predicate is the relationship between the resource and the
object, and the object can be another resource or a data value.
Grant Agreement 315637
PUBLIC
Page 36 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
OWL is the latest standard in ontology languages from the World Wide Web Consortium
(W3C). Built on top of RDF (OWL semantically extends RDF(S)), and based on its predecessor
language DAML+OIL. OWL has a rich set of modeling constructors. In 2004, the W3C ontology
working group proposed OWL as a semantic markup language for publishing and sharing
ontologies on the World Wide Web. From a formal point of view, OWL is equivalent to a very
expressive description logic, where an ontology corresponds to a Tbox. This equivalence allows
the language to exploit description logic researcher results. OWL extends RDF and RDFS. Its
primary aim is to bring the expressive and reasoning power of description logic to the semantic
web. Unfortunately, not everything from RDF can be expressed in DL. OWL provides two sublanguages: OWL Lite for simple applications and OWL-DL, that represents the sub-set of
language equivalents to description logic whose reasoning mechanisms are quite complex. The
complete language is called OWL full.
Grant Agreement 315637
PUBLIC
Page 37 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3 Analysis of online anti-fraud systems
3.1 Current Trends and Practices
3.1.1 Introduction
Before we proceed with the exposition of current trends and practices in online fraud
management, it would be essential to make a methodological distinction between fraud
detection and fraud prevention17. Fraud detection is the task of unmasking ongoing malicious
activity. For this purpose and depending on the online retailer’s expertise, several solutions
exist starting from manual order screening to advanced pattern recognition algorithms for
spotting anomalous user behavior. Fraud prevention (FP) points to early precautions (or safety
measures) that an organization has to take in order to discourage fraudsters from taking
further action. Manual card inspection, payment authentication codes and internet protocols
for secure information exchange can be thought more as pre-fraud practices, because they
really head into the direction of discouraging fraudsters from taking action in the first place. In
fact, FP is a composite task that goes well beyond a simple set of security protocols; it is a
blending of organizational practices, legislative framework, technology and government
policies. But, no matter how well “armored” is a system, there will always be cases where it
fails to detect an intrusion; especially if we take into account how quickly cybercriminals
manage to adapt to defensive mechanisms (Bolton and Hand, 2002). Therefore, it always pays
off to invest in new technologies that could early detect malicious activities before their
consequences become evident to the online merchant. The technological content of fraud
detection systems is the focus of the Work Package (WP) 3 of the SME E-COMPASS project.
3.1.2 Manual order review
The obvious way to deal with fraudulent transactions is through manual review, a practice
that most small & medium (SM) e-shops follow until today. A fraud specialist would examine
the incoming order, contact the customer to verify his/her shipping address, ask for
supplementary information/documents (e.g. a photocopy of the identity or credit card),
conduct a background research in social networks to sketch the profile of the customer, and
finally decide whether to execute or reject the order.
Notwithstanding the fact that manual order review contradicts the very idea of automating the
sales channel, it is also considered as time inefficient and cost prohibitive. According to
CyberSource fraud reports18 in North America 73% of e-merchants performs manual order
review, while 52% of fraud management budget is spent on order review staff costs. For most
17
18
See also Bolton and Hand (2002), Begdad (2012) for a discussion.
See footnotes 13 & 14 for the respective reports
Grant Agreement 315637
PUBLIC
Page 38 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
companies in US and Canada, budgets and resources for fraud detection remain unchanged in
2013. For UK 58% of respondents manually review transactions, down from 61% in 2012, while
7% analyse every order. Regarding the work load an average of 77 orders are reviewed
manually per reviewer daily. In general of those merchants that do perform review, larger
companies analyse a much lower proportion. This is expected given the scalability and cost
challenges associated with review.
Modern online shops are meant to operate on a 24/7 basis, receiving hundreds/thousands of
orders per day, each described by tens/hundreds of attributes (contact details, shipping
address, customer details, IP address, etc). Under these work conditions, it becomes extremely
difficult for human experts to adequately process all available information and respond within
a reasonable time frame. Instead, the increasing involvement of fraud specialists is likely to
lead to congestions in the order processing system, unnecessary delays and increasing
customer dissatisfaction. False positives cases contribute significantly to the total cost of fraud.
For instance physical goods retailers in UK14 reject a mean average of 6% of orders for fear of
fraud, while from all English retailers 4.3% of manually reviewed orders are rejected due to
suspicion of fraud.
For many reasons, manual screening can also result in a non-rational sales management. In an
attempt to reduce the chances that a malicious order passes unnoticed - especially after
having recorded a sequence of failures in detecting similar behaviors in the past - fraud
analysts often move to the other extreme. They become too strict against a large group of
(supposing) risky users, whole profile matches similar suspicious cases, or they ask for full
assurance until they give their approval to process the order. These “desperate” practices can
have undesirable consequences for online business: unnecessary delays in order-possessing, a
feeling of dissatisfaction or “punishment” among reliable customers and revenue leakage due
to the rejection of orders that look suspicious but in fact they are not (see also Leonard, 1995).
As the online business scales up, it becomes important for SM e-merchants to modernize the
transaction-validation process through the use of automatic monitoring tools. These could
act as a supplement to manual order review and help reduce its deficiencies.
3.1.3 Data used in fraud detection
One of the biggest problems associated with fraud detection is the lack of both literature
providing experimental results and of real world data accessible to academic researchers for
conducting experiments. This happens because fraud detection is often associated with
sensitive financial data that are kept confidential for reasons of customer privacy.
Most of the techniques used for detecting credit card fraud have as objective detecting
transaction that deviated from the norm. Deviation from the usual patterns of an entity could
imply the existence of fraud.
The main data sources used for online fraud detection are databases and data warehouses
with credit card transaction data, personnel databases and accounting databases. They usually
Grant Agreement 315637
PUBLIC
Page 39 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
belong to the banks or the credit card providers. Furthermore, in order to train the algorithms,
databases containing fraudulent transactions and legitimate transactions are needed.
Data representing the card usage profiles of the customers are also used. Every card profile
consists of variables each of which discloses a behavioral characteristic of the card usage.
These variables may show the spending habits of the customer with respect to geographical
locations, days of the month, hours of the day or Merchant Category Codes (MCC); which
shows the type of merchant that the transaction was placed.
Hand and Blunt (2011) describe the transaction records they used in their experiment. These
data are obtained from Visa credit card database. Each transaction record includes the
following fields:




Date that the transaction was recorded in the account. Note that this usually excludes
weekends and public holidays, and is around a day or two after the transaction was
actually made.
Amount of the transaction
Merchant Category Code (MMC) of the outlet where the transaction was made
Transaction type. This is an indicator of actions as: sales transaction, credit refunds,
cash handling charges and type of cash transaction (manual or at a cash machine).
The nature of data in credit cards’ fraud have the following characteristics (Hand, 2009):
-
Billions of transactions
Mixed variable types (in general not text data or image)
Large number of variables
Incomprehensible variables, irrelevant variables
Different misclassification cost
Many ways of committing fraud
Unbalanced class sizes (c. 0.1% transaction fraudulent)
Delay in labeling
Mislabeled classes
Random transaction arrival times
(Reactive) population drift
Credit card data used to be defined by means of 70-80 variables per transaction: Transaction
ID, transaction type, data and time of transaction (to nearest second), amount, currency, local
currency amount, merchant category, card issuer ID, ATM ID, POS, cheque account prefix,
savings account prefix, acquiring institution ID, transaction authorization code, online
authorization performed new card, transaction exceeds floor limit, number of time chip has
been accessed, merchant city name, chip terminal capability, chip card verification results are
among the most used.
US Patent “5, 819, 226” on Fraud detection and modelling, (HNC Software in 1992) lists the
following variables:
-
Customer usage pattern profiles representing time-of-day and day-of-week profiles
Grant Agreement 315637
PUBLIC
Page 40 of 144
SME E-COMPASS
-
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Expiration date of the credit card
Dollar amount spent in each SIC (Standard Industrial Classification) merchant group
category during the current day
Percentage of dollars spent by a customer in each SIC merchant group category during
the current day
Number of transactions in each SIC merchant group category during the current day
Percentage of number of transactions in each SI C merchant group category during the
current day
Categorization of SIC merchant group categories by fraud rate (high, medium, or low
risk)
Categorization of SIC merchant group categories by customer types (groups of
customers that most frequently use certain SIC categories)
Categorization of geographic regions by fraud rate (high, medium, or low risk)
Categorization of geographic regions by customer types
Mean number of days between transactions
Variance of number of days between transactions
Mean time between transactions in one day
Variance of time between transactions in one day
Number of multiple transaction declines at same merchant
Number of out-of-state transactions
Mean number of transaction declines
Year-to-date high balance
Transaction amount
Transaction date and time
Transaction type
To circumvent the data availability problems, one alternative is to create synthetic data which
matches closely to actual data. Barse et al (2003) justify that synthetic data can train and adapt
a system without any data on known frauds, variations of known fraud and new frauds can be
artificially created, and to benchmark different systems.
3.2 State-of-the-art technologies
3.2.1 Introduction
Fraud detection systems (FDS) are nowadays quite popular in e-commerce; according to a
recent market survey they are used by more than half of the US and Canadian merchants doing
business online19. A typical FDS receives information on the transaction parameters or the
customer profile and comes up with an indication as to the riskiness of the particular order
(riskiness/suspiciousness score). Based on its initial risk assessment, the order can follow
three independent routes: instant execution, automatic rejection or suspension for manual
19
See “2011 Online Fraud Report” (http://www.cybersource.com/current_resources).
Grant Agreement 315637
PUBLIC
Page 41 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
review. Modern FDS are typically categorized in three groups: expert systems, supervised
learning techniques and anomaly detection methods. These are of varying degree of
sophistication and also differ as to the mechanisms used to acquire and represent knowledge.
3.2.2 Expert systems
Expert systems (ES) are the most popular cases of computer-based FDSs. They contain a pool
of fraud detection rules and facts, which are interactively derived from domain experts. This
rule engine can be subsequently used to screen incoming orders and classify them as normal,
anomalous or partly suspicious. In other variations of ES design, rules do not explicitly provide
the classification result but assign to each order a suspiciousness score that can be interpreted
as the probability that the order is fraudulent or as a degree of similarity to other examples of
malicious activity. Expert rules typically take the form of a hypothetical (“IF-THEN”)
proposition20. The “IF” part combines several transaction attributes and the “THEN” part
outputs a classification or riskiness index. A hypothetical example of this sort of conditional
statement is given below:
IF credit_card  {Black_List} AND email_type={non_institutional}
THEN risk_score = 96%
This has the following interpretation: if the credit card used for payment is “black-listed” (for
example, because the same card has also been used in previous malicious transactions) and
the customer’s email address does not belong to a particular institution (i.e. it is “anonymous”)
then the probability of the order being fraudulent is 96%.
Leonard (1995) presents an example of an expert system prepared for a Canadian bank with
the purpose of flagging suspicious activity in credit card accounts. Fraud detection formulae
are collected from “in-house” experts using a variant of Delphi method for eliciting information
in a structured way. Stefano and Gisella (2001) present a methodology for building tree-like
structures of suspiciousness rules that are able to handle different cases of fraudulent
insurance claims received by an Italian company. These rules are constructed under the
principles of fuzzy logic, which provides a systematic framework for encoding qualitative
information and designing “smooth” classifiers21. This results in a behavior for the expert
system that closely resembles how human analysts handle fraudulent cases in practice.
Another example of a fuzzy rule-based authentication system for the insurance industry is
discussed in Pathak et al. (2005)22.
The obvious advantage of expert systems is the presented opportunity to encode the collective
experience of fraud professionals in a compact and manageable knowledge base. Nowadays,
the development of such a “knowledge platform” is made easy by the existence of numerous
20
See e.g. Hayes-Roth et al. (1983) and Silverman (1987).
Fuzzy classifiers typically output a degree of confidence by which objects can be classified in each
available category.
22
See also Phua et al. (2005) for a review of other studies utilising expert decision systems.
21
Grant Agreement 315637
PUBLIC
Page 42 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
commercially-available tools. Despite their structural simplicity and user-friendliness, expert
systems suffer from a number of disadvantages23:
 Subjectivity. The performance of an ES is solely determined by the quality of the
embedded knowledge. This means that it takes many rounds of expert interviews
and a lot of maintenance effort to bring the system to a level that it can effectively
combat a wide range of fraud types. But, even if the system is equipped with all
fraud “fingerprints”, there is no way to preclude that it will not carry the biases and
subjectivity flawing expert judgments.
 Limited controllability. The way knowledge is stored in a rule engine makes it difficult
for the system manager to have a sufficient control over the overall risk scoring
process. In a large knowledge base, there may be overlapping rules (i.e. rules fired
simultaneously by the same transaction) with conflicting verdicts. For instance, one
rule may suggest rejection while another may point to manual review. How best to
resolve these cases is not obvious.
 Lack of adaptivity. As in an expert system the maintainer rests on fraud analysts to
provide scoring rules, the system cannot quickly adapt to changes in intrusion tactics
or to the emergence of new types of cybercrimes. This sort of “knowledge aging”
has severe implications for the performance of the system in the long-run.
3.2.3 Supervised learning techniques
The key factor fuelling most of the problems associated with an expert system is the need for
human intervention. Therefore, it makes sense to investigate the prospect of mechanically
obtaining the knowledge that is necessary to combat fraud. This can be done by analyzing
historical transaction data that have been stored in an e-shop’s database, an idea being
incorporated by supervised learning techniques. The development of a self-learning FDS
typically follows two stages: training and validation24. In the training phase, the system is
presented with positive and negative examples of the concept to be learned, i.e. a particular or
multiple types of fraud. These examples are labelled or tagged, in the sense that they have
been pre-classified by experts in known normal or fraudulent categories (hence the term
“supervised learning”). The system analyzes the data and extracts general rules/models that
associate certain transaction characteristics with the pre-specified risk categories. Before the
system is put into action, it typically undergoes a validation process during which its
performance is tested on previously unseen records of legitimate/fraudulent transactions. This
validation stage allows analysts to have a more unbiased view on how the system is likely to
perform beyond the training dataset.
When browsing the literature for supervised-learning solutions to fraud detection, one ends
up with numerous results ranging from conventional statistical techniques to intelligent
23
24
See also MacVittie (2002) and Wong et al. (2011).
See e.g. Bolton and Hand (2002), Mitchell (1997) and Michalski et al. (1998).
Grant Agreement 315637
PUBLIC
Page 43 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
machine-learning algorithms. Most of these methods are data-driven and general-purpose, in
the sense that they have not been specifically designed for fraud detection but usually
borrowed from other application domains. Table 13 in the Appendix provides a comprehensive
list of research papers presenting, among others, supervised-learning technologies for fraud
monitoring.
Statistical supervised paradigms, such as logistic regression or discriminant analysis, are
nowadays considered as mainstream and mainly treated as benchmarks for more advanced
learning algorithms. Bhattacharyya et al. (2011) and Jha et al. (2012) advocate the use of the
logistic regression framework in the monitoring of credit card fraud, because of the capability
of these models to deal effectively with multiple learning classes. Logistic regressions
effectively output a classification probability distribution - or else a class membership array for each problem instance based on its descriptive attributes. Lee et al. (2010) use logistic
regression models to uncover a relatively modern and interesting case of e-fraud: the
manipulation of online auctions. Applications of logistic regression in online payment
monitoring are also found in Shen et al. (2007) and Brabazon et al. (2010). Discriminant
analysis has been applied among others by Whitrow et al. (2009), Louzada and Ara (2012).
More advanced paradigms of supervised learning for fraud detection include artificial neural
networks (Ghosh and Reily, 1994; Aleskerov et al., 1997; Hanagandi et al., 1996; Brause et al.,
1999; Shen et al. 2007; Xu et al., 2007; Gadi et al., 2008), support vector machines (Chen et al.,
2004; Whitrow et al., 2009; Bhattacharyya et al., 2011) and Bayesian classifiers (Maes et al.,
2002; Gadi et al., 2008; Whitrow et al.,2009; Louzada and Ara, 2012). Those commonly differ
from traditional statistical approaches in their ability to model complex data relationships and
nonlinear boundaries between problem classes (see also Hodge and Austin, 2004). Hence the
term machine learning that is often used to characterize this group of models (see Quinlan,
1993; Mitchell, 1997; Michalski et al., 1998).
A branch of supervised learning techniques induce more symbolic representations for the
obtained knowledge, typically in the form of associative rules or decision trees. Associative
rules link one or several input attributes with particular problem classes, while a decision tree
is a hierarchical classification structure by which cases are progressively assigned to preselected categories (tree leaves) based on the outcome of each decision node (Quinlan, 1993;
Mitchell, 1997). Decision trees are equally capable of encoding complex data relationships, just
as artificial neural networks, but they offer more user-friendly and transparent knowledge
representations. Hence, they are favored in application domains, such as fraud scoring, where
interpretability of the classification result is also an issue (see also section 3.5.9). Stolfo et al.
(1997), Prodromidis and Stolfo (1999), Prodromidis et al. (2000) use two rule-learning
techniques, namely RIPPER and CN2, as well as several tree-induction algorithms (ID3, C4.5,
CART) to create base classifiers for monitoring fraudulent activity in credit card transactions.
Other relatively recent studies employing tree inductive learning in the context of fraud
detection are Shen et al. (2007), Gadi et al. (2008), Whitrow et al. (2009), Bhattacharyya et al.
(2011) and Sahin et al. (2013).
A remarkably active trend in fraud monitoring systems is the application of computer
programs equipped with intelligent mechanisms for knowledge extraction. This sort of
Grant Agreement 315637
PUBLIC
Page 44 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
computational intelligence (CI) is typically built upon elements and metaphors of cognitive,
natural or social processes. Artificial neural networks are perhaps one of the earliest attempts
to create CI by imitating certain elements and functionalities of the human brain. Tree-learning
algorithms, such as ID3 and C4.5, use a variant of the physical concept of entropy, termed
information entropy, to hierarchically categorize problem variables and create a knowledge
representation structure (decision tree) that resembles the human reasoning mechanism. A
relatively recent area of research in computational intelligence is the development of
algorithms simulating behaviors or systems encountered in nature; for instance, the flocking
behavior of bird species, the foraging strategies of bees/ants, the processes of the immune
system, the biological evolution of species, etc. The increasing interest in natural computing
stems from the fact that one can learn a lot about how to handle complex problems by simply
observing what nature does in similar situations (Wong et al., 2011). Indeed, nature-inspired
(NI) techniques have some unique characteristics that help them overcome many of the
difficulties associated with traditional learning paradigms25:
1. Universality. NI algorithms are general-purpose techniques that make little (if any)
assumptions on the types of problem data (numerical, ordinal, categorical, etc) or the
data-generating process. Hence, they can be easily adapted to the problem context at
hand with slight only modifications.
2. Scalability. The performance of a NI technique is typically scalable with the size of the
learning problem. In more conventional statistical/CI paradigms, such as artificial
neural networks, when the number of variables increases, one has to perplex the
model structure or increase the number of free parameters to maintain a desirable
level of performance. In the case of NI paradigms, complexity can be obtained through
the aggregation and cooperation of multiple units or agents, which otherwise perform
simple tasks (collective/swarm intelligence). This gives NI algorithms the ability to
adapt to difficult learning situations while maintaining the simplicity and transparency
of the model structure.
Popular nature-inspired methodologies that are often used in fraud detection applications are
genetic algorithms, particle swarm optimization, ant colony optimization and artificial
immune systems (AIS). Behdad et al. (2012) provides a comprehensive survey of up-to-date
research studies in this area. Indicatively, we mention the works of Bentley et al. (2000), who
employ genetic programming to evolve a set of scoring rules for credit card transactions and
Brabazon et al. (2010), who apply the AIS methodology to identify credit card payment fraud in
an online store.
Artificial immune systems are of particular interest in fraud-related applications, as they
emulate the characteristics of an eminently sophisticated intrusion-detection system
developed by nature. The immune system of natural organisms has a unique ability to
recognize alien detrimental objects, which might have never come across before. It is also
equipped with prioritization mechanisms for allocating defence efforts based on each
25
See e.g. Vassiliadis and Dounias (2009); Cheng et al. (2011).
Grant Agreement 315637
PUBLIC
Page 45 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
intruder’s level of significance (see Kim et al., 2007; Wong et al. 2011, and the references
therein). Some studies that demonstrate the potential of AIS technologies in the automatic
monitoring of security breaches are Wightman (2003), Tuo et al. (2004), Gadi et al. (2008) and
Wong et al. (2011).
3.2.4 Anomaly detection technologies
The essence of anomaly detection is to pinpoint unusual deviations from what is thought to be
normal behavior26. The motivation behind the use of anomaly detection techniques in fraud
management is their ability to unmask fraudulent activity without resting on experts to
provide tagged training examples (i.e. in a purely unsupervised-learning mode). However, this
does not strictly apply to all cases of outlier detection techniques, as some require
representative data from one class, typically the class of normal transactions27. The biggest
advantage of novelty detection models is their limited dependence on historical positive
examples, which gives them the capability of detecting new types of malicious activities for
which there exists no prior experience (Ibid).
In the area of unsupervised fraud detection, there exists a plethora of techniques that differ in
morphology, complexity and efficiency. These can vary from simple visualization tools, which
offer an easy and intuitive way to pinpoint outlier transactions, to more advanced data mining
techniques, which perform multidimensional analysis and profiling of “normal” user behaviors
(Bolton and Hand, 2002). Patterns of normality can be drawn along several criteria, depending
on the availability of data and the types of services/goods offered by an e-shop28. Some
indicative examples are given below:

Average time spent to complete an order.

Frequency by which the same card is used across different purchases.

A typical range for the value of the goods purchased by the same customer and/or
using the same card. The spending profile may be further refined so as to take into
account variations across seasons, days of the week, hours of the day, etc.

Favourite types of goods or services. For instance, a web travel agency may keep a
record of typical journey routes (defined by the airport of origin and destination) for
each customer.

The behaviour of the peer group (Weston et al., 2008).
26
See Bolton and Hand (2001), Hodge and Austin (2004) and Agyemang et al. (2009) for comprehensive
reviews of anomaly detection methodologies and applications.
27
These are what Hodge and Austin (2004) call type-3 or semi-supervised outlier detection
methodologies.
28
See also Thomas et al. (2004), Siddiqi (2006), Delamaire (2009) and Bhattacharyya et al. (2011).
Grant Agreement 315637
PUBLIC
Page 46 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Over the past fifteen years, the scientific literature has seen many successful examples of
outlier detection techniques used in practical online security monitoring. Lee et al. (2013)
employ a version of principal components analysis to identify potentially malicious
connections to a computer network, a problem that shares many features with fraud
identification. Fan et al. (2001) spot abnormal network activity with the aid of rule-based
classifiers which are trained using examples of normal connections. Bolton and Hand (2001)
and Weston et al. (2008) employ an unsupervised statistical technique based on similarity
measures (namely Peer Group Analysis - PGA) to allow the parallel monitoring of a large basket
of credit card accounts and the early detection of suspicious changes in the owners’ spending
profiles. PGA is also adapted by Ferdousi and Maeda (2007) to the detection of suspicious
trading activity in a stock market environment. Statistical profiling for credit card transaction
monitoring in a both supervised and semi-supervised learning context is also discussed by
Juszczak et al. (2008).
Despite the flourishing of statistical paradigms, there are also many research studies that
employ more advanced computational schemes for unsupervised fraud detection. Xu et al.
(2007) present an intelligent algorithm for the monitoring of an online transaction system. This
algorithm induces customized rules for legitimate behavior which are subsequently used to
filter-out suspicious activity in a customer’s account. Self-organizing maps is another case of
unsupervised CI techniques that have been appreciated in the detection of credit-card fraud
(see Quah and Sriganesh, 2008; Zaslavsky and Strizhak, 2006; Chen et al., 2006). In the area of
natural computing, there have also been several examples of anomaly detectors for
transaction monitoring. Kim et al. (2003) use an artificial version of the human immune system
to detect insider fraud in the transaction processing system of a retail store. Ozcelik et al.
(2010) employ genetic algorithms to fine-tune the parameters of a bank’s profiling system
used for detecting credit card transactions that deviate from the norm.
3.2.5 Hybrid architectures
A hybrid system can be roughly defined as a smart combination of possibly heterogeneous
components with the aim of delivering superior performance to its building blocks.
Hybridization is typically achieved along two different routes29:
1. The aggregation of homogeneous entities. There are two variations of this scheme. In
non-hierarchical architectures, the overall task is undertaken by a group of equivalent
agents that interact with each other and exchange information. This is e.g. the model
of hybridization adopted by the nature-inspired optimization techniques (genetic
algorithms, particle swarm optimization, etc) discussed in section 3.2.3. In a
hierarchical mode, there exist high-level and base-level modules that perform different
sets of operations. For instance, meta-classifier architectures (Chan and Stolfo, 1993;
Stolfo et al., 1997; Chan et al., 1999; Prodromidis and Stolfo, 1999; Prodromidis at el.,
2000) comprise a group of base classification models, which perform individual
29
See also Tsakonas and Dounias (2002), Hodge and Austin (2004) and Dounias (2014).
Grant Agreement 315637
PUBLIC
Page 47 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
learning tasks, and a higher-level classifier, whose job is to aggregate the outputs of
elementary ones. The meta-classifier is also equipped with mechanisms for resolving
possibly conflicting assessments arriving from each classification unit. Stolfo et al.
(1997), Prodromidis and Stolfo (1999) and Prodromidis et al. (2000) present an
ensemble fraud-detection system that collects information from a network of
classifiers monitoring local bank accounts. Experimental results from real credit card
transaction data reveal that smart combinations of learning techniques can have
superior fraud detection performance when compared to standalone tree- or ruleinduction algorithms (ID3, C4.5, RIPPER, etc).
2. The blending of heterogeneous technologies. In this hybridization approach, one seeks
to combine different types of techniques, with documented success in performing a
given task, with the aim of creating a more robust system that is less vulnerable to the
deficiencies of each component. This is effectively a model risk diversification strategy
that can be implemented e.g. by blending supervised with unsupervised learning
techniques or statistical with computational intelligent models. Certainly, more
opportunities for hybridization are given in the context of computational intelligent
paradigms and nature-inspired systems (Tsakonas and Dounias, 2002; Dounias, 2014).
For example, Syeda et al. (2002) propose a parallel architecture that combines
elements of fuzzy logic with neural network technologies to aid at the timing discovery
of fraud. Park (2005) employs a genetic algorithm to optimize the parameters of a
neural network-based fraud detector with respect to a complex performance measure
(partial area under the Operating Characteristic Curve) that simultaneously takes into
account the false positive and false negative rates30. Intelligent optimization heuristics
are also adopted by Gadi et al. (2008) and Duman and Ozcelik (2011) to fine-tune
classifiers (a neural network and an artificial immune system) or a pre-existing set of
scoring rules under a misclassification cost criterion. Chen et al. (2006) employ a
hybrid computational scheme that combines self-organizing maps, genetic algorithms
and support vector machines. The genetic algorithm is used to decide upon the
placement of support vectors in proper regions of the solution search-space. Krivko
(2010) presents a transaction monitoring system that mixes behavioural models, for
flagging deviations from normal spending patterns in a group of customer accounts,
with expert rules for subsequently verifying the suspiciousness of each case. Another
possible synergy between intelligent supervised and unsupervised technologies for
fraud detection is put forth by Lei and Ghorbani (2012). Further hybrid fraud detection
schemes are reviewed in Table 1.
3.2.6 Semantic Web technologies and fraud detection
The objective of this sub-section is to review how semantics and semantics web technologies
have been used in the literature to solve the problem of fraud detection. Core capabilities of
30
See subsection 3.5.8 for definitions of these terms.
Grant Agreement 315637
PUBLIC
Page 48 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
this technology include the ability to develop and maintain focused but large populated
ontologies, automatic semantic metadata extraction supported by disambiguation techniques,
ability to process heterogeneous information and provide semantic integration combined with
link identification and analysis through rule specification and execution, as well as organization
and domain specific scoring and ranking. These semantic capabilities are coupled with
enterprise software capabilities which are necessary for success of an emerging technology for
meeting the needs of demanding enterprise customers.
Although the SME E-COMPASS project is focused on online fraud detection, this section
presents proposals regarding fraud detection in general, because the aim is to get an overview
about how semantics and semantic web techniques could be integrated with other
technologies in order to improve an online fraud detection system.
After a revision of the current scientific literature, the selected works can be categorized into
those which define ontologies to be used in fraud detection systems, those which use
ontologies for checking user behavior, those which use ontologies for detect suspicious
transactions and those which use semantic technologies and graph mining (based on
ontologies) for detecting non-frequent pattern and abnormalities of credit card use.
Fraud detection and prevention systems are based on various technological paradigms but the
two prevailing approaches are rule-based reasoning and data mining. Ontologies are an
increasingly popular and widely accepted knowledge representation paradigm. Ontologies are
knowledge models that represent a domain and are used to reason about the objects in that
domain and the relations between them (Gruber 1993).
Ontologies can help both of these approaches to become more efficient as far as fraud
detection is concerned. Ontologies have a lot to offer in terms of interoperability, expressivity
and reasoning. The use of ontologies and ontology-related technologies for building
knowledge bases for rule-base systems is considered quite beneficial for two main reasons
(Alexopoulos et al, 2007):


Ontologies provide an excellent way of capturing and representing domain knowledge,
mainly due to their expressive power.
A number of well-established methodologies, languages and tools (Gomez-Perez et al,
2004) developed in the Ontological Engineering area can make the building of the
knowledge base easier, more accurate and more efficient, especially in the knowledge
acquisition stage which is usually a bottleneck in the whole ontology development
process.
Ontologies are also very important to the data mining area as they can be used to select the
best data mining method for a new data set (Tadepalli et al 2004). When new data are
described in terms of the ontology, one can look for a data set which is most similar to the new
one and for which the best data mining method is known, this method is then applied to the
new data set.
Grant Agreement 315637
PUBLIC
Page 49 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Alexopoulos et al (2007) propose a methodology for building domain specific ontologies in the
e-government domain. The main characteristic of this methodology is a generic fraud ontology
that serves a common ontological basis on which the various domain specific fraud ontologies
can be built.
Kingston et al (2003) discuss the status of research on detection and prevention of financial
fraud, and analyze existing legal and financial ontologies in order to study for each the
strengths that they possess in different fields and address different aspects of the user
requirements.
Transactions made by fraudsters using counterfeit cards and making cardholder-not-present
purchases can be detected through methods which seek changes in transaction patterns, as
well as checking for particular patterns which are known to be indicative of counterfeiting.
Suspicion scores to detect whether an account has been compromised can be based on
models of individual customers' previous usage patterns, standard expected usage patterns,
particular patterns which are known to be often associated with fraud, and on supervised
models.
Fang et al (2007) propose a novel method, built upon ontology and ontology instance similarity
for checking user behavior. Ontology is now widely used to enable knowledge sharing and
reuse, so some personality ontologies can be easily used to present user behavior. By measure
the similarity of ontology instances, we can determine whether an account is defrauded. This
method lows the data model cost and make the system very adaptive to different applications.
Rajput et al (2014) address the problem of developing an effective mechanism to detect
suspicious transactions by proposing an ontology based expert-system for suspicious
transaction detection. The ontology consists of domain knowledge and a set of (SWRL) rules
that together constitute an expert system. The native reasoning support in ontology is used to
deduce new knowledge from the predefined rules about suspicious transactions. The
presented expert-system has been tested on a real data set of more than 8 million transactions
of a commercial bank. The novelty of the approach lies in the use of ontology driven technique
that not only minimizes the data modeling cost but also makes the expert-system extendable
and reusable for different applications.
The existence of data silos is considered one of the main barriers to cross-region, crossdepartment, and cross-domain data analysis that can detect abnormalities not easily seen
when focusing on single data sources. An evident advantage of leveraging Linked Data and
semantic technologies is the smooth integration of distributed data sets.
The relational database has recognized limitations as a solution basis for scenarios where data
is highly distributed, sizable and where model structures are evolving and de-centralized. New
paradigms in data management, collected under the label “Big Data”, offer alternative
solutions able to process increasing amounts of available data. For fraud detection the
challenge is efficiently pinpointing small anomalies in Big Data. This is often based on patterns
of relationships between data. Essentially, benefit fraud detection is a semantic alignment and
pattern matching problem.
Grant Agreement 315637
PUBLIC
Page 50 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Hu et al (2012) report a case study of applying semantic technologies to social benefit fraud
detection. Authors claim that the design considerations, study outcomes, and learnt lessons
can help making decisions of how one should adopt semantic technologies in similar contexts.
In a nutshell, by leveraging semantic technology, organizations are able to dynamically
describe new fraud cases and facilitate the integration, analysis, and visualization of disparate
and heterogeneous data from multiple sources. Also by using the semantic technology and
hence generating semantic fraud detection rules, we manage to convert labor intensive tasks
into (semi-) automated processes (Hu et al, 2012).
With the recent growth of the graph-based data, the large graph processing becomes more
and more important. In order to explore and to extract knowledge from such data, graph
mining methods, like community detection, is a necessity. Although the graph mining is a
relatively recent development in the Data Mining domain, it has been studied extensively in
different areas (biology, social networks, telecommunications and Internet).
The traditional data mining works are focused on multi-dimensional and text data. However,
nowadays new emergent industrial needs lead to deal with structured, heterogeneous data
instead of traditional multi-dimensional models. This kind of structured dataset is well
designed as graph that models a set of objects that can be linked in numerous ways. The
greater expressive power of the graph encourages their use in extremely diverse domains. In
credit card fraud detection transactions are modelled as a bipartite graph of users and
vendors. Therefore, graph mining algorithms could be used for detecting credit card fraud.
Skhiri and Jouili (2012) present a survey on recent techniques for graph mining and makes a
study about which are the challenges and the possible solutions in this emerging area.
Ramaki et al (2012) present a technique for detecting abnormalities credit cards operations by
exploiting ontology. Specifically, it uses ontology graph for modelling every user’s transaction
behavior and then storage it in the system. During abnormality detection only those
transactions from registered history of transactions are selected to perform computation
which are highly similar to entry transactions. Detecting abnormalities transactions using
ontologies is a very efficient approach through which low computational overload and less
storage for managing credit cards transactions data is required and data mining is utilized for
abnormalities detection.
Grant Agreement 315637
PUBLIC
Page 51 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3 Commercial products in place
This subsection’s scope is to briefly present the software products and tools that exist
currently in the market for e-merchants. An overview of each product is presented, whereas
the content is sourced by the respective web-sites of the tools’ providers (see relevant
footnote)
3.3.1 Product: Accertify (an American express product)
Overview
Constantly evolving card-not-present fraud easily defeats fraud detection products that are
inflexible or use limited data types. These products can lose their effectiveness over time
allowing fraud rates to creep higher, putting the merchant right back where they started.
Accertify’s Fraud Management31 was developed to perform well beyond these limitations
through the advanced, scalable and highly flexible Interceptas Data Management Platform.
At its core, the Interceptas Platform is data-focused, enabling it to effectively and efficiently
make use of vast and disparate enterprise data to more completely and accurately detect
fraud.
Features


















31
SaaS based platform
Integrated case management and rules engine
Advanced Reporting; Ad Hoc and Dashboard capabilities
Extensive fraud database matching (Risk ID)
Platform support for local language, currency and time zone
Simple point-and-click to link transaction elements in review
Supervisor prioritization and management dashboard
User friendly rules creation & validation
Built-in IP geo-location data
Built-in high risk address/phone look up
Built-in global post code data
Built-in BIN information
PCI-DSS Level 1 Certified
SSAE 16 Certified Data Center Provider
ISO/IEC 27001 Certified
EU Safe Harbor Compliant
American Express® Risk Management Services
Integration with Leading Data Services Providers
http://www.accertify.com/solutions/fraud-management/
Grant Agreement 315637
PUBLIC
Page 52 of 144
SME E-COMPASS





D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Advanced Statistical Models
Customized models supplement fraud rules and lift screening accuracy
Accertify Profile Builder32
Dynamic 360 degree view of each customer’s complete transaction history to optimize
fraud detection and support new applications
Customized Report Development
3.3.2 Product: Cardinalcommerce
Overview
Cardinal's Consumer Authentication33 technology ties the authentication process to the card
authorization process, where a PIN/password or other unique identifier acts as a ‘digital
signature' that validates cardholder identity in a CNP transaction. Data elements are then
encrypted and transmitted through a PCI/DSS secured environment.
Features



Flexible and configurable rules engine
Authentication based on rules
- Issuer-deemed high-risk customers (or high-risk customers - according to the
issuer)
- International transactions only
- High ticket, high fraud product SKUs
Control the checkout experience for those customers chosen to authenticate
3.3.3 Product: Identitymind
Overview
Identitymind provides a three step anti-fraud evaluation with the patent-pending eDNA
(electronic DNA) technology which recognizes Internet users by their online transactions and
behavior.
32
The Accertify profile builder allows merchants to create views around a customer, a product, an
event or any number of data points. Merchant profile data are collected, securely stored and aggregated
as defined by the merchant and can be used for any number of potential use cases, including: account
takeover, entity monitoring, customer loyalty, e-commerce / Card Not Present fraud, policy compliance
and usage demographics. Merchants benefit from real-time summarization and aggregation capabilities
to help lower the total cost of fraud, especially manual reviews, and turn large volumes of disparate
data into actionable intelligence.
33
http://www.cardinalcommerce.com/
Grant Agreement 315637
PUBLIC
Page 53 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Features















User identities with payment reputation
Deep integration with payment network
Proactive refunding
Integrated third party services
Data sharing across identitymind’s ecosystem
Real-time protection against systemic fraud
Chargeback analysis and reports
Cross payment methods analysis, identities can have multiple types of payments (e.g.
credit cards, digital wallets, ach, etc); the platform tracks the identity’s payment
behavior across all payment methods.
Affiliate fraud protection
Mobile platforms support
Rule decision engine
Manual review automation
IP geolocation
Device fingerprint
Extensible api
3.3.4 Product: Iovation
Overview
Iovation TrustScore34 spotlights good customers; even when they are new to the business.
Iovation’s unique ability to provide a TrustScore is built on a rich data of the 9+ billion
transactions analyzed in the Device Reputation Authority, including over 1.7 billion device
histories including past behavior and any association to fraud or abuse. Applying powerful
predictive analytics to this data allows iovation to deliver a TrustScore on customers of the
business.
Features






34
Business Rules
Reporting & Analytics
Geolocation & Real IP
Mobile Recognition
Deployment Options
Real-Time Response
https://www.iovation.com/products/trustscore
Grant Agreement 315637
PUBLIC
Page 54 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3.5 Product: Kount
Overview
Kount Complete35 provides a single, turnkey fraud solution that is easy-to-implement and
easy-to-use. The all-in-one, Kount Complete platform is designed for businesses operating in
card-not-present environments, simplifying fraud detection and dramatically improving
bottom line profitability.
Features









35
Multi-Layer Device Fingerprinting collects a comprehensive set of data that positively
identifies a device in real time whether fixed or mobile without retrieving the user's
Personally Identifiable Information.
The Proxy Piercer feature combats fraudsters who use proxy servers to hide their
actual location. Typically, the location of anyone accessing the Internet can be
identified via the IP address assigned to their computer by their Internet Service
Provider.
The Persona feature is a method of determining key characteristics and identified
qualities/attributes associated with a transaction in real time.
The Dynamic Scoring feature monitors a credit card for signs of fraudulent activity
even after a transaction has been approved. This “post-authorization” process has
proven highly successful at spotting suspicious activity and retroactively tying that
activity to previous purchases. The Kount Complete system then alerts the merchant
that a previously-approved order now looks to have relevant connections to
fraudulent activity. The merchant can re-evaluate the order and decline to ship
avoiding the loss of the goods while also preventing the expense of a chargeback.
The Kount Score feature provides merchants with more predictive control and
customization in the way they manage their fraud risk.
The AutoAgent feature is a powerful rules engine that enables administrators and risk
assessment managers to create custom rules for orders with specific characteristics.
Business Intelligence Reporting, allows the monitoring of overall order traffic through
the Kount Agent web Console. Additional reports can be run to ensure the security of
the application, such as login attempts and configuration setting changes.
The Agent Workflow Console helps increase operational efficiency and reduces the
cost of manual reviews. This feature addresses one of the largest fraud prevention
costs for merchants: the training and maintenance of human risk assessment agents to
manually review orders. Using a pattern-based rules engine and auto-decision
routines, the Agent Workflow Console feature enables superior operational
efficiencies when reviewing transaction activities, evaluating risk, and managing
human assets.
Mobile device analysis
http://www.kount.com/products/kount-complete
Grant Agreement 315637
PUBLIC
Page 55 of 144
SME E-COMPASS

D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Workflow management is an important factor that ensures the efficient and effective
processing of orders flagged for review. Based on established rules, Kount’s Workflow
Queue Manager quickly sends suspect transactions to the most appropriate review
agent for a convenient and appropriate resolution.
3.3.6 Product: Lexisnexis
Overview
With coverage on greater than 10 billion consumer records and 300 million unique businesses,
as well as extensive identities, assets and derogatory histories, LexisNexis Fraud Solutions36
provide relevant information about people, businesses and assets.
Features







Chargeback Defender uses state-of-the-art identity and address verification tools to
confirm both billing and shipping information. It also uses advanced IP address geolocation software to verify each order's originating city, state, country and continent.
The robust fraud detection engine in Chargeback Defender evaluates high-risk patterns
or conditions found during address and identity verification. It resolves false-positive
AVS failures using a customer's most current address data, and summarizes all results
in a single three-digit score.
Instant Authenticate is the next generation of identity authentication above and
beyond traditional knowledge based authentication that uses various capabilities of
the solution depending on the risk level of the transaction being conducted. This
means, you have the most configurable options and can be broad or targeted in the
approach. You have the flexibility to configure or target the customer demographics.
We can also share industry best practices. All of which results in a quiz that matches
the level of risk associated with the transaction.
Multi-Factor Authentication: authenticate a user through multiple factors
Instant Age Verify: verifies identities and ages
Retail Fraud Manager:automate workflow and connect fraud detection tools
Instant Verify:verifies IDs and professional credentials instantly
TrueID authenticates a user through fingerprint biometrics
3.3.7 Product: Maxmind
Overview
There are two main product offerings the GeoIP databases and web services and the minFraud
Fraud Detection Services.
36
http://www.lexisnexis.com/
Grant Agreement 315637
PUBLIC
Page 56 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Features
MaxMind's GeoIP products enable the identification of the location, organization, connection
speed, and user type of the Internet visitors.
The minFraud service37 reduces chargebacks by identifying risky orders to be held for further
review. The minFraud service is used to identify fraud in online e-commerce transactions,
affiliate referrals, surveys, and account logins and signups.
The minFraud service determines the likelihood that a transaction is fraudulent based on many
factors, including whether an online transaction comes from a high risk IP address, high risk
email, high risk device, or anonymizing proxy. One of the key features of the minFraud service
is the minFraud Network, which allows MaxMind to establish the reputations of IP addresses,
emails, and other parameters.
The minFraud Network is made up of the over 7,000 e-commerce businesses that use the
minFraud service. Users of the minFraud service benefit from a dynamic and adaptive
approach to fraud detection and the mutual protection of the minFraud Network. Feedback
from merchants serves as a warning signal to all others within the minFraud Network. The
minFraud service can function on its own or as a complement to existing in-house fraud
checking systems.
Key features of the minFraud service include:









The riskScore (the likelihood that a transaction is fraudulent)
Geographical IP address location checking
High risk IP address and email checking
Proxy detection
Device tracking
Bank Identification Number (BIN) to country matching
The minFraud Network
Prepaid and gift card identification
Post query analysis
3.3.8 Product: Subuno
Overview
Subuno38 is a fraud prevention SaaS platform that is easy to implement, easy to use, and built
specifically for small and medium sized businesses. Access over fifteen fraud screening tools in
one centralized system without having to copy and paste order information across different
systems or setup separate accounts with each provider.
37
38
http://www.maxmind.com/en/ccv_overview
http://www.subuno.com/
Grant Agreement 315637
PUBLIC
Page 57 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Features





Streamlining Fraud Screening by offering a single cloud platform
Cloud SaaS Leverage multiple fraud screening tools and solutions using a single
platform.
Rules/Decision Engines: processing transactions automatically based on business rules
created by the user.
Manual Review Portal: review transactions faster by having all the relevant data,
analysis, triggered rules on the same screen automatically. Manual Entry or
Automated API: add new transactions manually or use the company’s API to
automatically send the transactions to Subuno for processing.
Reporting, obtain a quick snapshot of the business' performance everyday through
daily reports.
3.3.9 Product: Braspag
Overview
Tools for assisting merchants in the risk analysis processes for fighting fraud.
Features





39
Velocity – this tool stores the information from credit card transactions and crosses
them with Braspag’s database. Based on the statistics generated, the merchant is
notified on how many times the same card, IP, full name, ZIP code, e-mail address
and/or CPF went through a Braspag database or site within a certain period of time.
The rules for this risk evaluation are established by the merchant.
Warning List – stores positive and/or negative information on the end-customer, such
as name, e-mail address, CPF, ZIP code, address, and credit card. Braspag’s client
maintains and consults this database whenever necessary. Upon consulting the
database, the merchant will know whether the end-consumers have positive or
negative histories.
AVS via Acquirers – already integrated to the Braspag platform39, the Address
Verification System was developed by the credit card acquirers/operators (currently
only via Redecard and American Express in Brazil) to cross information registered on
the site by the card holder with the billing information on the card used in the
transaction (information is checked at the card issuer).
IP Geolocalization – this tool indicates where geographically the end-consumer is
making the transaction. The merchant can cross this information with other
registration or product delivery information and decide if the transaction is a fraud risk
or not.
Integration with services provided by partners – Braspag is integrated with neuralnetwork technology to assist risk management and combating fraud.
http://www.braspag.com/
Grant Agreement 315637
PUBLIC
Page 58 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3.10 Product: Fraud.net
Overview
Fraud.net40 is a repository of data crowdsourced from online retailers regarding fraudulent
records. The data are provided by other online retailers who are combating fraud on a daily
basis. The goal is to provide online retailers with better tools for fraud prevention. To date,
Fraud.net has pooled over 4 billion data points in its repository. Online retailers can check
individual orders with Fraud.net to see if other retailers have experienced problems with that
potential customer. Then, online retailers can decide to not ship orders to that customer or
conduct further research to verify the customer's authenticity.
Features

Contributing Data: With a verified account, users can submit information into the
Fraud.net data repository. Data can be submitted in a variety of formats including
online forms, .xls/.csv/.xml files, as well as via web services. Submission of data on
fraudsters will help other retailers avoid shipping to those individuals. Fraud.net gives
the chance to merchants to report fraudsters who have abused the Card-Not-Present
purchasing environment.
3.3.11 Product: Volance
Overview
MERCHANTGUARD SUITE: An automated platform designed to integrate with any online
commerce shopping cart or order form system and web site to help identify and prevent
identity theft and credit card fraud using six specially made modules. Designed to work for
businesses of all sizes, MerchantGuard41 features multiple integration techniques and an
extensive API for remote development and integration.
Features
• User Data Validation
• TrueIP Detection
• Computer History Reports
• Velocity Detection
• web Hosting Module
• Social Network Validation
• Proxy server lists
• Known Fraud IP address lists
• Known Fraud E-mail address lists
• Zombie/hacked computer lists
40
41
http://www.fraud.net/
http://www.volance.com/small_business.php
Grant Agreement 315637
PUBLIC
Page 59 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3.12 Product: Authorize.net by Cybersource.com (a Visa company)
Overview
Identification, management and prevention of suspicious and potentially costly fraudulent
transactions with the Advanced Fraud Detection Suite (AFDS) product. The product offers
customization with its rules-based filters match different types of business.
Features
AFDS includes multiple filters and tools that work together to evaluate transactions for
indicators of fraud. Their combined logic provides a powerful and highly effective defence
against fraudulent transactions. In addition to the filters listed below, Authorize.Net42 also
offers a new Daily Velocity Filter at no charge. The Daily Velocity Filter allows the user to
specify a threshold for the number of transactions allowed per day, a useful tactic to identify
high-volume fraud attacks.








42
Amount Filter – Sets lower and upper transaction amount thresholds to
restrict high-risk transactions often used to test the validity of credit card
numbers.
Hourly Velocity Filter – Limits the total number of transactions received per
hour, preventing high-volume attacks common with fraudulent transactions.
Shipping-Billing Mismatch filters and identifies high-risk transactions with
different shipping and billing addresses, potentially indicating purchases made
using a stolen credit card.
Transaction IP Velocity Filter – Isolates suspicious activity from a single source
by identifying excessive transactions received from the same IP address.
Suspicious Transaction Filter – Reviews highly suspicious transactions using
proprietary criteria identified by Authorize.Net's dedicated Fraud
Management Team.
Authorized AIM IP Addresses – Allows merchant submitting Advanced
Integration Method (AIM) transactions to designate specific server IP
addresses that are authorized to submit transactions.
IP Address Blocking – Blocks transactions from IP addresses known to be used
for fraudulent activity.
Enhanced AVS Handling Filter – The Address Verification Service (AVS) is a
standard feature of the payment gateway that compares the address
submitted with an order to the address on file with the customer's credit card
issuer. Merchants can then choose to reject or allow transactions based on the
AVS response codes. AFDS includes a new AVS filter that assists the decision
process by allowing merchants the additional options of flagging AVS
transactions for monitoring purposes, or holding them for manual review.
http://www.authorize.net/
Grant Agreement 315637
PUBLIC
Page 60 of 144
SME E-COMPASS




D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Enhanced CCV Handling Filter – Like AVS, Card Code Verification (CCV) is a
standard feature of the payment gateway. CCV uses a card's three- or fourdigit number to validate customer information on file with the credit card
association. Like the AVS Filter, the CCV Filter allows merchants the additional
options of flagging CCV transactions for monitoring purposes, or holding them
for manual review.
Shipping Address Verification Filter – Verifies that the shipping address
received with an order is a valid postal address.
IP Shipping Address Mismatch Filter – Compares the shipping address provided
with an order to the IP address of where the order originated from. This helps
to determine whether or not the order is shipping to the country from which it
originated.
Regional IP Address Filter, flags orders coming from specific regions or
countries.
3.3.13 Product: 41st Parameter
Overview
The majority of clients are complex and multi-national corporations. 41st Parameter’s43
approach to fraud detection for merchants provides material contributions to the operating
plans by not only slashing fraud-related chargebacks, but also simultaneously reducing
operational expenses by an average of 35% and improving revenue leakage by eliminating
auto-reject of transactions, an industry-wide problem that cancels an average of 7% of all
sales, many in error.
Features
The main Global Fraud Management Portal and its Decision Manager offering provides
capabilities to automate and streamline fraud management operations, including the ability to
leverage the fraud detection radar. It provides more data about the inbound order, as well as
comparisons to data generated from the over 60 Billion transactions that Visa and CyberSource
process annually including truth data.
Some of the services are listed below:







43
Device fingerprinting with packet signature inspection
IP Geolocation
Velocity monitoring
Multi-merchant transaction histories/shared data
Neural net risk detection
Positive/negative/review lists
Global telephone number validation
http://www.the41.com/
Grant Agreement 315637
PUBLIC
Page 61 of 144
SME E-COMPASS















D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Global delivery address verification services
Standard card brand services (AVS, CVN)
Customs fields for user’s data
Powerful Business User Rule Management Interface
Creates and modifies rules on-demand
Creates multiple screening profiles tailored to the business and products
Passive mode allows to test rules before going "live"
Flexible Case Management System: CyberSource Intelligent Review
Technology
Consolidated review data to streamline order review
Customizable case management layouts and search parameters
Automated case ownership and priority assignment
Automated queue SLA management
Semi-automatic callouts to third-party validation services
Advanced process analytics and reporting
Optional export of data to the case system via our API/XML interface
3.3.14 Product: Threatmetrix
Overview
With competitors only a click away, e-commerce sites have to balance fraud prevention with
keeping the online purchase experience as simple as possible. ThreatMetrix offers real-time,
context-sensitive fraud prevention that helps e-commerce merchants manage risk in real-time.
The TrustDefender Cybercrime Protection Platform44 provides comprehensive, context-based
authentication, protecting mission-critical enterprise applications from hackers and fraudsters.
ThreatMetrix has created a comprehensive process to create trust across all types of online
transactions, guarding against account takeover, card-not-present, and fictitious account
registration frauds.
Features







44
Profile Devices
Identify Threats
Examine Users’ Identities and Behavior
Configure Business Rules
Validate Business Policy
Enable Detailed Analysis
ThreatMetrix Global Trust Intelligence Network
http://www.threatmetrix.com/platform/trustdefender-cybercrime-protection-platform/
Grant Agreement 315637
PUBLIC
Page 62 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3.15 Product: Digitalresolve
Overview
Fraud Analyst is a proven platform45 for risk-based authentication, fraud detection, and realtime identity verification that is helping to reduce online fraud by as much as 90 percent for all
customer segments.
Regardless of the online touchpoint - from logins to new account creation to online
transactions - Fraud Analyst provides protection for every customer segment and user session,
offering a layered approach to fraud detection and prevention that helps to secure online
accounts from today's most advanced criminals.
Features




45
Transaction Monitoring: Fraud Analyst leverages a powerful transaction
analysis engine that monitors every online interaction and transaction and
provides flexible response mechanisms that allow the organization to address
incidents based on the business, risk and operational policies. By tracking all
user activity in real time, Fraud Analyst provides seamless, individualized
protection for every user based on their unique behavior, bringing perspective
to events that may seem uninteresting in isolation or that may appear to be
fraudulent at first glance -- but are perfectly legitimate for a particular online
user.
Login Authentication: Fraud Analyst provides transparent login authentication
that offers strong protection while maintaining the normal customer
experience. This risk-based approach to authentication is helping to reduce
online fraud by as much as 90% by spotting anomalies in the way in which
users normally access their accounts, and by offering further authentication in
real-time should a login meet pre-defined risk-thresholds.
Identity Verification: Fraud Analyst automates, expedites and secures the
online account opening and registration processes. By marrying elements of
the physical world to dynamic information about applicants in the online
world, Fraud Analyst prevents application and enrolment fraud in real-time -adding another dimension to traditional identity verification checks.
Research and Reporting Tools: At the core of Fraud Analyst is a powerful risk
analysis engine that offers unparalleled insight and actionable information for
all online touchpoints and user sessions, allowing organizations to take a
proactive role in fraud prevention. Fraud Analyst comes standard with
advanced out-of-the-box and customer-driven research, risk analysis and
reporting tools that identify fraud within the online channel at both the
individual and enterprise level, allowing an organization to spot emerging
fraud patterns and take a deep-dive into specific fraud incidents.
http://www.digitalresolve.com/
Grant Agreement 315637
PUBLIC
Page 63 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.3.16 Product: Nudatasecurity
Overview
NuDetect46 is a comprehensive behavior analytics platform that identifies and confronts
criminals with early user profiling and threat appropriate countermeasures. NuDetect
highlights their intent before they have a chance to penetrate a web site and do damage.
Features










Mobile Optimised: Whether deployed via native app or web site, NuDetect
uses mobile optimised event sensors for maximum acuity and security across
mobile apps and services.
Real-time Detection and Mitigation: system monitors activity in real time,
allowing to take action against fraud, as it happens.
Situational Context: Customized sensors which are specific to business’s
unique security requirements.
Historical Context Awareness: NuDetect uses historical cross-session and
cross-cloud behavior patterns, stored in the NuData cloud. This gives
incredible accuracy and safety from day one.
Adaptive Counter Measures: Suspicious actors are challenged with threat
appropriate countermeasures not only designed to impede or stop a suspect
but to give further intelligence on the nature of the suspect.
Decrease Customer Abandonment: Comprehensive profiles ensure no
deployment of unnecessary countermeasures against hard earned users.
Machine Learning: creates positive and negative behavior patterns which are
automatically adapted in real-time. Stored in the NuData cloud, the web
service will benefit from thousands of intelligence profiles.
Trigger Alerts and Countermeasures: control of the levels of alerts and what
those alerts can trigger.
Actionable Intelligence: NuDetect attributes a unique score to every user
interaction. A customised risk model provides actionable intelligence.
SaaS
3.3.17 Product: Easysol
Overview
DetectTA47 is a fraud prevention solution that qualifies a transaction’s risk in real time based
on a heuristic profile of the user’s behavior that the product is learning over time. DetectTA
ensures that user accounts are protected, because no matter how it’s being done or what
malware is being used, the differences from normal user activity can still be detected across all
banking channels.
46
47
http://nudatasecurity.com/nudetect/
http://www.easysol.net/newweb/Products/Detect-TA
Grant Agreement 315637
PUBLIC
Page 64 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Features








Real-Time Risk Qualification
Cross-Channel Support
Completely Integrated Case Management and Reporting
FFIEC and Regulatory Compliance
Suspicious Activity Analyzers
Personalized Interactive Dashboard
Risk-Based Authentication when Combined with DetectID
Customizable Rules
The following table matrix attempts an initial functionality positioning of the above reviewed
products.
Grant Agreement 315637
PUBLIC
Page 65 of 144
SME E-COMPASS
Machine Learning
x
Score
x
Proxy
PCI-DSS/ SSAE 16/ ISO/IEC 27001
x
Manual review enhancements
IP
x
Affiliate protection
Geo-location
x
API extensibility/web services
Database/networks
x
Device profile/fingerprint
Reporting
x
Check for Risk
Rules
x
Address Verification
SaaS
FUNCTIONALITIES
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
ANTI-FRAUD PRODUCTS
Accertify
x
Cardinalcommerce
Identitymind
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Kount
x
x
Lexisnexis
Maxmind
x
x
Braspag
x
x
Volance
x
The41st Parameter
x
Threatmetrix
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Easysol
x
x
x
x
Digitalresolve
Nudatasecurity
x
x
x
Fraud.net
x
x
x
x
Authorize.net
x
x
Iovation
Subuno
x
x
x
x
x
x
Table 2: Functionality comparison table of anti-fraud commercial products
Grant Agreement 315637
PUBLIC
Page 66 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.4 Research project results
In this sub-section, European Research projects related to fraud detection are presented. Such
research projects have been accomplished in the scope of Research Framework Programs of
the European Commission.
The project search engine in http://cordis.europa.eu/projects/home_en.html has been used
in order to identify such projects. Six projects have been found. To a greater or lesser extent,
these projects are related to fraud detection. However, only one of them is directly related to
online fraud detection. Furthermore, none of them is related to credit card fraud detection.
On the other hand, most of the projects were developed during 1990s and it was not possible
to find a web site describing the project results and/or the project deliverables. More recent
projects are those which explore the use of ontologies and semantic technologies for fraud
detection. These projects were developed at the beginning of 2000. The most recent project
finished in 2011.
The first project proposes techniques for monitoring online user transactions and detecting
fraudulent behavior. The second developed statistical techniques for extracting knowledge
from large databases which could be used in fraud detection systems. The following project
developed a parallel data-mining server which could be also used in fraud detection systems.
The last project, which is the most recent one, developed a highly scalable middleware
platform able to process in real time massive data streams, as credit card transactions data.
1. Customer On Line Behaviour Analysis. Start date June 1996. Duration 15 months. This
project was founded by the 4PF “Information & Communication Technologies”. The
objective of the project was the development of a core European technological offering in
the emerging sector of Customer On Line Behavioural Data Analysis. The need for high
performance fraud detection applications based on this core technology has evolved, with
different levels of maturity, in different markets. The project proposed that the
combination of HPCN technologies and advanced pattern recognition techniques can
provide a suitable solution to the need of monitoring these transactions and detecting
fraudulent behaviours in an on-line environment. The focus was on the requirements of
the credit card market, where a significant market for on-line fraud detection was mature.
A neural based fraud detection software prototype was developed and a performance
characterisation activity, on SMP architectures, was executed with specific reference to
credit card fraud detection requirements.
2. Data analysis & Risk analysis in support of anti-fraud policy. This project developed and
implemented methods and techniques in data analysis, risk analysis and statistical data
mining on both dedicated data bases of reported cases of irregularities and frauds and
publicly available databases with a view to the estimation of fraud, detection of patterns
and trends and assessment of data quality. This project contributed to the protection of
the financial interests by applying and developing statistical techniques for the efficient
Grant Agreement 315637
PUBLIC
Page 67 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
and objective extraction of knowledge from relevant and large databases. Results of the
project extend and enrich the range of proactive approaches to fraud control.
3. Data Mining File Server. Start date December 1995. Duration 36 months. This project was
founded by the 4PF “Information & Communication Technologies”. The project aimed to
enhance the performance and functionality of data-mining systems by building a special
purpose parallel data-mining server and an associated front-end client. This was be
achieved by:
a. Building a parallel data-mining client server product with scalable high
performance;
b. Adding value by improved functionality and cost performance;
c. Satisfying the data-mining needs of the data-dependent industries.
The main technical innovation was the implementation of current and emergent datamining technology and associated database techniques on a CPU intensive server running
on the Parsys parallel platform. Large volumes of data, too great for analysis, have been a
major problem for end users. The results of this project tried to make it possible to search
and analyse these very large databases in order to find information important to the
competitiveness of many organisations. Results of the project could be applied in Fraud
Detection systems.
4. STREAM: Scalable autonomic streaming middleware for real-time processing of massive
data flows. From 2008-02-01 to 2011-04-30. This project was an EU seventh framework
funded (FP7-216181) and it aims at producing a highly scalable middleware platform able
to process in real time massive data streams such as the IP traffic of an organization, the
output of a large sensor network, the e-mail processed by an ISP, the market feeds from
stock exchange and financial markets, the calls in a telco operator, credit card payments,
etc. This will enable a myriad of new services and applications in the upcoming Internet of
Services. A few examples applications which require the ability to analyze massive
amounts of streaming data in real time are: stock market data processing, anti-spam and
anti-virus filters for e-mail, network security systems for incoming IP traffic in organizationwide networks, automatic trading, fraud detection for cellular telephony to analyze and
correlate phone calls, fraud detection for credit cards, and e-services for verifying the
respect of service level agreements. Results of the project were applied to process a huge
number of credit card transactions.
Finally, two projects are focused on the use of ontologies for fraud detection.
5. FF-POIROT: Financial Fraud Prevention-Oriented Information Resources using Ontology
Technology. This project was an EU fifth framework funded, Information Society
Technologies (IST) project (IST-2001-38248). The project explored the use of ontology
technologies in the field of financial fraud prevention and detection," explains Dr Gang
Zhao from the Free University of Brussels’ STARLab. This facilitated intelligent data
processing and knowledge management from structured information in databases and
unstructured data from web pages. He added: "It focuses on fraud detection and
prevention scenarios such as detecting illegal online solicitation of financial investment
Grant Agreement 315637
PUBLIC
Page 68 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
and checking against VAT carousel fraud," the circular trade of cross border purchases
between connected companies.
6. IWEBCARE: An Ontological approach for fraud detection in the health care domain. The
European iWebCare project (FP6-2004-IST-4-028055) aimed at designing and developing a
flexible fraud detection web services platform, which was able to serve e-government
processes of fraud detection and prevention, in order to ensure quality and accuracy and
minimize loss of health care funds. The approach this project adopted involved the
introduction of a fraud detection methodology combining business process modelling and
knowledge engineering as well as the development of an integrated fraud detection
platform combining an ontology-based rule engine and a self-learning module based on
data mining.
3.5 Weaknesses and limitations of current practices compared to SME
needs
3.5.1 Introduction
The purpose of this section is to summarize the findings of the literature survey and expose
the weaknesses and limitations of fraud detection technologies and practices already in place.
The discussion is given with an eye on the special features of the application domain and the
business environment faced by SM online merchants. Some of the challenges associated with
modern anti-fraud technologies are also highlighted by Fawcett et al. (1998), Axelsson (2000)
and Behdad et al. (2012).
3.5.2 Lack of adaptivity
Fraud detection is a highly dynamic and non-stationary learning problem. Every day, new types
of fraud and malicious activities make their appearance in response to stricter security policies
(Sahin and Duman, 2011). In addition, legitimate customer behaviors change with the
succession of seasons or economic cycles. All these factors contribute to a continuously
changing learning environment which quickly outdates existing knowledge about fraud
detection. Non-stationarity typically plagues fraud monitoring systems that operate in a
supervised-learning mode, as these purely rest on historical data to extract generalized
prototypes of legal/illegal behavior. However, it is also a problem for outlier detection
techniques. Imagine an algorithm that detects anomalous transactions by simply observing
deviations from the typical purchasing behavior of “good” customers. This is doomed to
perform poorly, if spending profiles exhibit seasonal variations or change completely with a
downward swing of the economy. The first case may be easy to address by creating conditional
Grant Agreement 315637
PUBLIC
Page 69 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
rules of legitimate behavior depending on seasonal trading levels48. In the second case, the
solution is not so trivial, simply due to the fact that there might exist no prior experience for
the oncoming market conditions or it might be difficult to get early warnings of business
cycles.
Alternative terms that data-mining experts use to describe the problem of non-stationarity are
population drift (Hand, 2007) and concept drift (Behdad et al. 2012). These concepts stress on
the fact that - apart from the introduction of new types of fraud - various other aspects of the
learning problem may change from time to time. For instance, technology innovations may
enable the monitoring of new transaction attributes, the prevailing economic conditions may
change the relative frequency of occurrence between fraudulent/authentic orders or the
online shop may decide to launch new types of products/services (see also Abbass et al.,
2004). A particular case of concept drift that inflicts supervised learning is when fraudsters
alter their behavior to resemble the typical usage profiles of an e-shop’s web site. This
adaptation is part of the inherent competition between perpetrators and security managers,
the so-called “arms race” (Hand, 2007; Behdad et al. 2012). Obviously, in this case, the stored
signatures of normal/fraudulent activity are no longer valid and need to be updated to meet
current conditions. However, deciding when exactly to initiate this update process might be an
issue (see also Behdad et al., 2012).
Whatever the source of non-stationarity, it has important implications for the principles
governing the design of future transaction monitoring systems. In particular, the success of an
automatic fraud detector (FD) should lie in its ability to effectively respond to a changing
environment. Currently available FD technologies lack this kind of self-adaptiveness, as they
assume a great deal of human involvement in the preparation and labelling of
training/validation datasets. This reduces the hopes for developing an autonomous FD system
that is solely based on expert rules or supervised learning techniques (see also Xu et al., 2007).
With anomaly detectors similar problems arise, as many state-of-the-art systems effectively
utilize prototypes of normality to isolate fraudulent cases. These normality prototypes are too
extracted from historical data.
Most common recipes proposed in the literature against the problem of non-stationarity are
to re-train the fraud detector in periodic or irregular intervals (Burge and Shawe-Taylor, 1997;
Bolton and Hand, 2002) or to employ an autonomous learner being able to self-organize. In
fact, several nature-inspired classifiers, such as the artificial immune system, possess this kind
of property (see e.g. Wong et al., 2011). However, these systems are yet under development
and their full potential has not been realized yet.
3.5.3 Lack of publicly available data/ joint actions
One of the major obstacles to the large-scale deployment of online security systems is the lack
of publicly available data for R&D activities. Very few companies/organizations are willing to
48
Provided, of course, that one has a large set of transaction data with sufficient representation across
seasons.
Grant Agreement 315637
PUBLIC
Page 70 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
share real customer transaction data, either because of security, privacy or competitiveness
issues (Gadi et al., 2008; Srivastana et al. 2008; Sahin and Duman, 2011). Scientific research in
the area of fraud detection is typically performed in a highly controlled environment, with
strict terms about the disclosure of experimental details and the dissemination of findings. For
this reason, most of the published studies use “camouflaged” data sets with encrypted
attributes and, occasionally, hide several aspects of the experimental design. This “secrecy”
surrounding R&D developments in fraud detection makes it difficult to make fair comparisons
across different technologies, to boost the understanding of fraud through the exchange of
practices/knowledge and also to commercially exploit the findings of data mining models
(Bolton and Hand, 2002; Phua et al., 2005; Sahin and Duman, 2011; Wong et al., 2011; Ngai,
2011). Meta-learning architectures offer a promising solution to this problem, as they
effectively distribute the overall fraud recognition task among independent agents that
operate in the isolated environments of possibly competitive organizations (Stolfo et al., 1997;
Prodromidis et al., 2000). Still, these architectures have not been fully embraced by
practitioners and researchers to the extent that they become state-of-the-art commercial
solutions for SM merchants. Therefore, there is yet much room for progress in the direction of
collaborative anti-fraud actions.
3.5.4 Scalability issues
Extracting fraudulent patterns typically entails processing huge volumes of transactions
described by tens or hundreds of attributes. Most of the transaction parameters are possibly
irrelevant, in the sense that they contain information of little use for the fraud analyst (Hand,
2007). The inherent size and dimensionality of the problem severely slows down the learning
rate of common (semi-) supervised FD schemes and hence diminishes their ability to respond
adequately and timely to intrusion attempts49. Distributed learning architectures seem to be
less susceptible to scalability problems, as they split the overall learning task between several
agents each of which is presented with a portion of the total transactions data (see Stolfo et
al., 1997; Chan et al., 1999; Prodromidis et al., 2000).
A fundamental yet largely unsolved problem in fraud detection is how to narrow down the
search for fraudulent patterns to spaces of manageable dimensionality (dimensionality
reduction). This is equivalent to isolating those attributes from each transaction that can
describe fraudulent activity in the most compact and efficient way. Dimensionality reduction is
also related to the feature selection problem in data mining - or variable significance testing
in statistics - and it is an area of active research. Feature selection is generally considered as a
computationally intensive problem severely hindered by combinatorial explosion, nonlinear
cross-dependencies among problem variables and data redundancy. This makes it difficult to
deal with using standard commercially-available tools. Nature-inspired (NI) optimization
methods often present an attractive alternative for large-dimensionality solution spaces. In
49
This particularly applies to certain types of highly-parameterized models, such as artificial neural
networks, although currently there exist procedures for parallelizing and thus speeding-up their learning
process (see e.g. Syeda et al., 2002).
Grant Agreement 315637
PUBLIC
Page 71 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
fact, NI heuristics, such as genetic algorithms, have been used in combination with
classification or rule-based systems to derive optimal values for model parameters, rule
weights, acceptance/rejection thresholds, etc (see e.g. Park, 2005; Gadi et al., 2008; Duman
and Ozcelik, 2011).
3.5.5 Limitations in integrating heterogeneous data and information
sources
Nowadays, with the expansion of mobile internet, geo-informatics and social networking
services, new opportunities have arisen for fraud investigators. For instance, it is possible to
get more insight into each case by performing a geo-analysis of various transaction parameters
(IP, contact or shipping address), supplementing customer profiling with information from
social network accounts or making associations with other fraudulent cases through ecommunity analysis techniques50. No matter how promising these developments might seem,
they are still difficult to be encompassed by current SM anti-fraud technologies and practices.
The cross-investigation process described above often requires integrating information from
various globally-dispersed sources (bank databases, social networks, geo-analytical services)
that typically comes in a variety of forms (numerical, symbolic, images, etc) (Hand, 2007) 51.
How to automate this process remains a challenge, despite the great deal of research work in
this direction over the past thirty years52.
3.5.6 Dealing with case imbalance and skewed class distributions
Fraud detection is a learning problem where the case of interest (fraud) makes up a tiny
portion of the total volume of data (transactions). A typical ratio of normal to fraudulent cases
can be as large as 10,000 to 1 (Sahin and Duman, 2011). This is the well-known case imbalance
or skewed class distribution problem, discussed among others by Fawcett et al. (1998) and
Hand (2007). Case imbalance still poses a challenge to state-of-the-art supervised learning
algorithms, as it hinders the creation of “rich” training datasets with a good coverage and
balance between all problem instances. The small representation of fraudulent cases typically
leads to over-fitting (i.e. the reproduction of patters that are only specific to the training data
set) and poor generalization capabilities of learned models. In fact, more advanced
computational intelligent learning schemes, which are ideal for digging out complex data
relationships, are paradoxically more prone to over-fitting (Lawrence et al., 1997). In the
context of supervised learning, several techniques have been proposed for dealing with case
imbalance, although there are still many open issues53. Common practice suggests adopting a
proper performance metric that accounts for skewed data distributions (see e.g. Behdad et al.,
50
See Cortes et al. (2001) and Bolton and Hand (2002).
Indicative of recent trends is the 2011 e-fraud survey launched by CyberSource, which inquires
participating merchants about the extent at which they utilise geo-analytics and social network services
as a supplementary tool to order validation. See “2011 Online Fraud Report” available from
http://www.cybersource.com/current_resources.
52
For instance, Lee et al. (2010) describe a fraud-detection system that relies on autonomous agents for
automatically retrieving and classifying information from distributed web sites.
53
See also Kotsianis et al. (2006) and Chawla (2010).
51
Grant Agreement 315637
PUBLIC
Page 72 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
2012) or employing a specialized sampling scheme that effectively changes the relative
frequency of occurrence and makes fraudulent patterns more apparent in the data set (see
e.g. Sahin and Duman, 2011)54. Semi-supervised or novelty detection algorithms, detailed in
section 3.2.4, are less affected by case imbalance, as they only require exemplars of one class,
typically the class of normal transactions (Hodge and Austin, 2004). Some nature-inspired
computational paradigms, such as artificial immune systems, also share this feature (see e.g.
Hunt and Cooke 1996; Behdad, 2012).
3.5.7 Difficulties in managing late- or false-labelled cases
Fraud detection is not a typical data mining task in which case labels are readily available or
unambiguously defined. For example, the recognition of a fraudulent order may not be
possible until a chargeback claim is sent from the acquiring bank. Some other cases of fraud
may even pass unnoticed depending on how the cybercriminal has decided to exploit the
presented “gap” in the security mechanisms55. Most fraudsters typically try to “get the most
out of it”, which soon exposes their intentions. Those cases are perhaps easy to detect and
label. Other perpetrators, however, are more cautious in unfolding their tactic and manage to
slip the card holder’s or merchant’s attention for a considerable time. Thus, it may take many
fraudulent attempts before suspicious transactions are recognized (see also Bolton and Hand,
2002; Gadi et al., 2008; Whitrow et al., 2009). All the cases analyzed above cause delays or
mistakes in the classification of orders. Another case of delayed labelling which is of particular
interest is when a customer initially disputes a transaction (e.g. by failing to remember it or to
recognize the merchant’s code in the card billing statement) but eventually accepts it.
Although this is not fraud in the strict sense, it may still trigger a chargeback process and
temporarily result in false labelling. Delayed or false assigned cases can cause serious problems
to rule-based or self-learning fraud detection systems, because they can adversely affect their
performance despite the fact that they may undergo periodic maintenance.
3.5.8 Cost-efficiency concerns
The job of a fraud scoring system should be considered successful to the extent that it
manages to handle effectively incoming requests with little intervention from human experts.
This is because, in a typical e-commerce business, the resources available for order validation
are typically restricted, especially when response time is also an issue56. Most studies putting
forth a new transaction-validation technology typically support its superiority on the grounds
of performance metrics centered around the false negative (type I) and the false positive
(type II) error rate (see e.g. Hand, 2007; Hand et al., 2008). A false negative assessment arises
54
Fan et al. (2001) propose a novel methodology for defending network intrusions by generating
artificial cases of “malicious” connections. The increased proportion of positive examples helps the rulebased classifier define more accurately the notional “boundary” of normal network usage profiles.
55
See also Hand (2007).
56
See Hand (2007) for a discussion on the economics of fraud detection systems.
Grant Agreement 315637
PUBLIC
Page 73 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
when a malicious transaction is mistakenly characterized as normal. A false positive error - also
known as false alarm - happens when a transaction is rejected for being harmful, whereas, in
fact, it is perfectly legitimate. Basing technology comparisons on empirical type-I or -II error
rates effectively pre-assumes equal misclassification costs, a hypothesis that hardly holds in
practice.
In order to a make a realistic breakdown of fraud management costs, we have to take into
account how the fraud defensive mechanism has been designed57. Most of the commercial
anti-fraud systems assume a collaborative scheme involving automatic order filtering tools and
trained investigators. When a fraudulent order passes unnoticed, it results in a direct financial
loss for the merchant (e.g. from a chargeback claim or stolen goods/services), which may vary
with the type and the value of the products/services sold. This loss may be augmented by the
cost of committing a background investigation, if manual reviewers have also been involved in
the case. Falsely rejecting a reliable client results in opportunity losses (on top of fraud-staff
expenses), which may be a serious concern in highly competitive market conditions. Obtaining
accurate estimates of opportunity costs is problematic, as a false positive verdict typically
prompts the customer to leave the e-store, from when onwards his/her traces are lost58.
In the literature, there have been numerous attempts to adopt cost-based performance
metrics for fraud monitoring systems (Chan and Stolfo, 1998; Prodromidis and Stolfo, 1999;
Stolfo et al., 2000; Gadi et al., 2008; Ozcelik et al., 2010; Duman and Ozcelik, 2011; Sahin et al.,
2013). However, these still lack a holistic perspective of cost-efficiency that takes into account
all aspects of fraud defence operations (fine-tuning and maintenance of transaction screening
tools, manual order verification, management of chargeback claims, etc). It is important to
realize that designing a fraud detector from a cost-efficiency point of view, often results in a
completely different system attitude towards fraudulent cases than what the maximization of
a detection rate would dictate. For instance, the system may shift its attention towards
fraudulent transactions with a serious economic footprint, while leaving unattended smallvalue orders even if they look suspicious. Although this behavior adversely affects its
performance in terms of detection rate, it may finally lead to a system that is more efficient
from an economic point of view (Xu et al., 2007). In principal, there is no anti-fraud technology
that satisfies equally well all possible performance criteria, although some prior analysis is
always required to understand how conflicting different aspirations truly are59. How best to
exploit the potential of data mining techniques in a decision-making situation with cost
considerations is also discussed in Elkan (2001).
57
See also the “2011 Online Fraud Report” (http://www.cybersource.com/current_resources) for an
insightful analysis of fraud detection costs.
58
In fact, one of the few cases for which the revenue loss can be accurately estimated is when a
legitimate order is initially mis-flagged by the automated screening tool and subsequently recovered by
a fraud analyst.
59
This is equivalent to exploring the Pareto optimal set of system configurations.
Grant Agreement 315637
PUBLIC
Page 74 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
3.5.9 Lack of transparency and interpretability
No doubt, the growing number of research studies in online security management systems is a
proof of the ability of these technologies to highlight and early diagnose patterns of fraudulent
activity. However, there is still a great deal of scepticism as to how efficiently these systems
can be incorporated in a practical e-business environment. Among the various lines of
criticism, some experts report as a problem the complexity of the resulting classification
structures and the “opaqueness” of forms by which the obtained knowledge is presented to
the end-user. Some commercially-available risk-evaluation tools employ highly-parameterized
models, such as support vector machines and neural networks. However, these models often
lack an acceptable level of interpretability, in the sense that it remains difficult for the enduser to decode and understand the classification result60. This feature has led many authors to
adopt the term “black-box approach” when they refer to these network-learning architectures
(see e.g. Robinson et al., 2011; Ryman-Tubb and Krause, 2011). On the contrary, rule bases and
decision trees are considered more intuitive and user-friendly forms of representing
knowledge. But still, the advantages of these architectures can be lost in learning problems
characterized by complex data relationships and highly-dimensional search spaces.
Model transparency is very important in practical applications and, when it comes to ecommerce, it is also imposed by good customer relationship practices (Goodwin, 2002). For
instance, it is always important for the merchant to understand why a particular order has
been blocked or to be able to provide enough justification to a potentially reliable customer
whose transaction has been initially rejected by the system. Despite the efforts that have been
made to boost the application of rule- and tree-inductive learning algorithms in fraud
detection, much of these paradigms still suffer from other types of weaknesses analyzed
above. Therefore, one expects to gain more from these techniques in a cooperative learning
model (hybrid architecture).
60
See e.g. Wightman (2003) and Wong et al. (2011) for a discussion.
Grant Agreement 315637
PUBLIC
Page 75 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4 Analysis of data mining for e-sales
This section is dedicated to data mining techniques applied for e-commerce with a focus on
SMEs opportunities and threats. Begins with the presentation of the state-of-the-art
technologies and continues with the current trends and practices for e-sales and data mining
techniques. Additionally describes the commercial products in place such as web analytics,
data mining suites and tools for price search. Next sub-section focuses in recent research
project results as well as review of scientific literature on the domain. The section finishes with
the weaknesses and limitations of current practices compared to SME needs.
4.1 State-of-the-art technologies
As stated in section 2.2 web analytics (Carmona et al., 2012; Hassler, 2012; Kumar, Singh, &
Kaur, 2012; Web Analytics Association, 2008) build the foundation of data mining (Astudillo,
Bardeen, & Cerpa, 2014; Rajaraman et al., 2013) for e-sales. The three main types of data that
are crucial for e-shop owners are data about
4. where the customer came from before he visited the e-shop and, in case of search
engines as the last step before visiting, which keywords where used for the search
5. the users’ behaviour onsite, e.g. usage statistics and real-time behaviour
6. competitor products, prices and their terms and conditions as well as their marketing
strategies and actions
With tools and methods of web analytics and data mining, information can be derived from
these data that allows the e-shop owners to understand their customers and potential
customers better and to optimize their offering and marketing.
Web analytics tools usually analyse web site referrers in order to provide the first kind of data.
This is used to optimize marketing activities and marketing channels. The second kind of data
provides insights in user behaviour and potentials for the optimization of the own web site or
e-shop. The challenges for e-shop owners and therefore the state of the art which needs to be
taken into account are in the following areas (Mobasher, Cooley, & Srivastava, 2000; Yadav,
Feeroz, & Yadav, 2012):




Gathering the kinds of data from which valuable information can be derived
Extracting valuable information from those data sets
Analysing this valuable information in a way that appropriate actions can be taken
Automatizing these actions
Grant Agreement 315637
PUBLIC
Page 76 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4.1.1 Data gathering
4.1.1.1 Conversion information
Google analytics is currently the most widely used tool in the market and leading in terms of
the first category of data which is needed for the optimization of marketing and channels. The
approach behind this tool is to gather as much information as possible on the path which the
user took before he entered the web site or e-shop which uses Google analytics61. Referrer
information is gathered and analysed and enriched with information that Google has from its
own user behaviour data, e.g. the history of web searches and keywords used that actually led
the user to click on the URL of e-shop within the Google search results. Alternatives to Google
Analytics are among others the following: Clicky62, Mixpanel63, Foxmetrics64, Open web
Analytics65, Piwik66, KISSmetrics67. However those cannot come back to the vast pool of
information with which Google is able to enrich their analytics.
4.1.1.2 User behaviour information
Available technologies for the task of data mining and web analysis comprise especially the
following



Web Content Mining (Liu & Chen-Chuan-Chang, 2004)
Web Structure Mining (Markov & Larose, 2007)
Web Usage Mining (Woon, Ng, & Lim, 2005)
Web analytics tools as the ones named in Error! Reference source not found. Error! Reference
source not found. collect rich data sets of the content and structure of an e-shop and put
them into relation to additionally collected information of the actual usage of the e-shop, e.g.
click paths within the shop, entry and exit pages or the length of visits, just to name a few.
Correlation of the data gathered, as well as statistics on these data over a longer period of
time and a large number of visitors allow for pattern analyses and the application of tools and
methods of data mining and finally also machine learning (see 4.1.2 Error! Reference source
not found.). Once correlated with other data, user behaviour data can, for example, be used as
input for recommender systems and be linked with social web applications (Niwa, Takuo Doi,
& Honiden, 2006).
4.1.1.3 Competitor information
Web Scraping (Concolato & Schmitz, 2012; Grasso, Furche, & Schallhart, 2013) or simply
buying information from online marketplaces such as Amazon or specialized price search
engines provide the means for accessing data on competitor offerings and especially changes
61
https://www.google.de/analytics/
http://clicky.com
63
https://mixpanel.com
64
http://foxmetrics.com
65
http://openwebanalytics.com
66
http://piwik.org
67
http://www.kissmetrics.com
62
Grant Agreement 315637
PUBLIC
Page 77 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
in those offering. In particular price searches (Kandula & Communication, ACM Special Interest
Group on Data, 2012) are of interest. Price search engines such as Google Shopping Search,
idealo.com, shopping.com or swoodoo.com solve the problem of covering relevant markets
and have the technology implemented to perform an effective scraping of the information
needed. Online marketplaces such as Amazon or Rakuten inherently possess the required
information on products and prices and provide another valuable source of information about
competitors and their products. With the target group of small and medium enterprises it is
unrealistic to think of own real-time web or price scraping implementations of e-shop
owners to gather this kind of information as they are simply to complex and cost-intensive.
In addition, this information needs to be up-to-date, so for commodity products which are
easily comparable by potential customers, real-time analyses would be necessary with
information not older than five (5) minutes. Even for long tail products which are much more
difficult to compare by the potential customers the up-to-date information of for example
competitor price should stay within the range of one day.
The challenge for the retrieving of competitor information lies in the provision of appropriate
tools which allow the scraping of the required information, e.g. prices, within a certain
timeframe and an easy to use user interface which allows an appropriate configuration of the
search and scraping tasks.
4.1.2 Data extraction and analysis
The challenge for data mining in e-sales and especially for the target group of SMe-shop
owners who are addressed by SME E-COMPASS is to generate added value information from
the available data sources in an easy to use way, in order to optimize sales efficiency within
the own e-shop.
Data mining in e-commerce has produced a rich state of the art technologies, methods and
algorithms and is strongly related to fields such as business intelligence and analytics (Lim,
Chen, & Chen, 2013) as well as machine learning (Vidhate & Kulkarni, 2012) and statistics
(Kandel, Paepcke, Hellerstein, & Heer, 2012). Methods relevant for data mining comprise
among others “statistical analysis, decision trees, neural networks, rule induction and
refinement, and graphic visualization” (Astudillo et al., 2014).
One challenge is the integration of the above-mentioned data in order to provide additional
analyses of customer behaviour information in comparison to market information, e.g. price
information from competitors.
4.1.3 Automatized reaction to data analysis
The final step in data processing of each kind and thus also for online shopping is to
automatically perform actions or reactions depending on the identification of certain patterns
when analyzing the data. The above mentioned area of machine learning is one option with
Grant Agreement 315637
PUBLIC
Page 78 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
focus on the refinement and improvement of data analysis methods and algorithms. Focusing
on the optimization of e-shop offerings and marketing, the approach of rule engines and event
processing (Obweger, Schiefer, Suntinger, Kepplinger, & Rozsnyai, 2011) seems most
promising. The challenge here lies in dealing with the variety of goods and product features as
well as in their vast numbers.
For the challenge of pricing optimization rules engines can help defining pricing strategies for
products or groups of products, so an automatic reaction of the own e-shop to changes in
competitor pricing can be triggered by defining lower and upper thresholds. Data from price
search engines, marketplaces and own data mining solutions may be used as input for
gathering relevant price information from competitors and be able to analyze them. There are
a large number of products already available on the market offering features such as












Channel analysis
Competing Product Analysis
Customer Analysis
Forecasting
Market Analysis
Price List Management
Price Optimization Automation
Price Plan Management
Price Testing
Pricing Analytics
Profitability Analysis
Scenario Planning
Examples for web-based software of this kind are PriceLenz68 or RepricerExpress69 which are
accompanied by a far larger amount of on premise software solutions. The latter however
have higher barrier to entry for small e-shop owners.
A challenge which is often neglected in this context is the customization of business rules
(Zhang, He, Wang, Wang, & Li, 2013) which on the one hand would allow the setup of very
specific rule sets going beyond the offerings of standard tools and on the other hand allow an
easy to use interface for small e-shop owners to use and handle the rules without being too
complex. This however requires the semantic analysis of rules that the e-shop owner could
ideally formulate in plain language or provide an appropriate user interface which provides a
specific set of predefined rules which can be configured by the small e-shop owner. The
challenge is to interpret these rules into machine executable rule sets and to relate them to
the appropriate data necessary for the decision making. This would also require interactions
with the data analysis mechanisms and possible algorithms which produce the data required
for decision making in the first place.
68
69
http://www.pricelenz.com
http://www.repricerexpress.com
Grant Agreement 315637
PUBLIC
Page 79 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Another challenge in implementing automated rule-based reactions to analysis results is the
interface of such engines with the e-shop software itself as well as with tools supporting the
marketing strategy and execution, e.g. the connectors to Google AdWords, Facebook Ads and
other providers of online advertising solutions.
4.1.4 Information presentation/visualization
Finally, there is a need for reporting:
1. the results of data analyses,
2. the automatized actions which have been assigned when identifying certain patterns,
and
3. the recommendations for manual activities to be performed to the small e-shop
owners.
Especially in the fields of business intelligence and big/smart data applications (Chen, Chiang,
& Storey, 2012) a large variety of dashboard and visualization tools have emerged over recent
years. Product comparison sites such as www.findthebest.com give an overview over the
software and Software-as-a-Service solutions market.
The challenge lies in the integration of the visualization into the previously mentioned
modules of data mining.
4.2 Trends and practices for e-sales
Carrying out a study of current trends in the field of e-marketing it becomes necessary to
consider experts who are close to the market demand and the practical trends. Many experts
suggest conducting different strategies in e-marketing.
Taking into account the activities that have been described to facilitate the deployment of emarketing strategies, next it is going to put the focus on the e-marketing trends as a previous
step of explain the technical tendencies and practices for e-sales70.
EMT
E-Marketing Trend
1
Brick and mortar
2
Offering increasingly
online features
Putting in context
For instance, companies that have helped retailers
utilize store networks as an asset, providing premium
services such as delivery within 90 minutes of ordering.
complex
For example, when you are looking at 3D augmented
virtual fitting rooms you may have gone a bit too far.
70
Matthew Valentine based on http://www.retail-week.com/multichannel/analysis-what-is-the-next-phase-of-theecommerce-revolution/5052020.article
Grant Agreement 315637
PUBLIC
Page 80 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Mobile everywhere
“Shoppers want convenience, speed and choice. They
want to shop anytime, anywhere, on any device,” says
Olivier Ropars, senior director of mobile at eBay
Europe.
Market consolidation
“We are using analytics to understand what return on
investment we can get from marketing spend", says
Dixons ecommerce director Jeremy Fennell.
Gamification within e-commerce
It turns online shopping into an opportunity to play, or
integrating elements of social media to encourage
peer-to-peer recommendations or advice, is all part of
discovery shopping.
Emotions and loyalty
With such elements in place, retailers can seek to
provide an online form of window shopping that is
more engaging than the physical equivalent.
7
Same day delivery
Dixons ecommerce director Jeremy Fennell is about to
launch same-day delivery, and Fennell says it is this
sort of service that will give traditional retailers an
edge: “we are offsetting the cost of delivery by
creating value in our proposition that customers are
prepared to pay for”
8
Segment for personalize: the own
shop view for each individual
customer.
Exclusive offers,
differentiation
3
4
5
6
develop
own
brand
names,
Table 3: List of e-marketing trends
In the following section, some trends and practices of e-sales are introduced which have a data
mining technique implemented in the back-end71 (the information in the brackets relate to the
four e-marketing section which have been listed in section Error! Reference source not found.
Error! Reference source not found.):


Pick-up speed or omnichannel customer experience72 (Online shopping & online
collaboration): In this trend retailers are letting customers upload video clips
modelling new products or using a new purchase. In this way, small e-shops could
set up in their web sites a menu or submenu to publish results of a data mining
classification technique.
Social-Networking testing (Ayada, W. M., & Elmelegy, 2014) (Online promotion &
Online collaboration): Taking advantage of the social media’s impact, it exists a
trend to check the “likes” or “favourite” messages that have been set up by the
71
http://www.forbes.com/sites/lauraheller/2011/04/20/the-future-of-online-shopping-10-trends-towatch/
72
http://www.practicalecommerce.com/articles/57800-The-Commerce-EvRolution-Part-2-ChannelTrends
Grant Agreement 315637
PUBLIC
Page 81 of 144
SME E-COMPASS





D1.1 – SME E-COMPASS Methodological Framework– v.1.0
users in Facebook, twitter and so on. For this, it can be interesting to create a
message like an “I wish this product” and measure the historical evolution of how
many users have clicked over the message. In order to applying this solution, a
data mining regression technique is required.
List of Wishes (Online promotion & online service)73: Related to previous trend
another one is to generate a list of product that customers could desire. In this
sense, the web sites of small e-shop could define a submenu to visualize a result
of which groups of desired product customers have been classified. In order to
apply this solution, a data mining clustering (cluster analysis) method is
implemented.
Cross border e-sales (Online promotion): This accelerating trend becomes more
and more important for small e-shops, especially when taking into account that
the BRIC countries are developing increasingly opportunities for online sales74.
Therefore, the necessity of entering into international markets is relevant to
improve or maintain the business and its revenues. In order to implement a
localized e-shop for certain countries or regions the e-shops should feature
different layouts depending on the country or region which is addressed. In this
sense, it would be crucial to monitor the visitors’ actions on the web site by the
digital footprint. In this way, the knowledge could be derived of how the web site
should be localized to improve the communication with the foreign target market
and its potential customers. Data mining techniques, such as the classification and
clustering (cluster analysis) methods, can support this kind of analysis.
Suggestive selling (Mussman, Adornato, Barker, Katz, & West, 2014) (Online
promotion & online shopping): A practice in which the holders of e-shops seek to
increase the value of their sales by suggesting related lines: “related to items
you've viewed” or “featured recommendations”. The usage of affinity analysis
techniques are recommended for this practice.
Web banner advertising (Ozen & Engizek, 2014) (Online promotion & online
service): This trend uses Internet to deliver promotional marketing messages to
consumers based on the web site and related to products of own company or its
associates. This case is similar to the list of wishes which can be applied with a
data mining clustering method.
Rewards (Online promotion & online collaboration): In other words,
merchandising for e-shop stores75, a traditional practice where the consumers
receive coupons, discounts or a percentage of the sale that can be accumulated
and redeem for later orders over the e-shop. In addition, the performance can be
increased by spreading print publications and newsletters which offer deals. In
this case, a regression method may be applied which allows an e-shop owner the
prediction of the most attractive product for providing reductions.
73
http://esellermedia.com/2014/01/20/ecommerce-trends-expect-2014/
http://www.practicalecommerce.com/articles/4142-Cross-Border-Ecommerce-Booming
75
http://www.practicalecommerce.com/articles/57800-The-Commerce-EvRolution-Part-2-ChannelTrends
74
Grant Agreement 315637
PUBLIC
Page 82 of 144
SME E-COMPASS

D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Search engine optimization or mega markets76 (Online service): Potential
customers expect that the web site quickly appears and it shows the content or
product offer which was promoted in the search engine. For this case, a clustering
method can be used. The clusters can be automatically analysed by a program or
by using visualization techniques in order to support the customer.
Table 4 matches the e-marketing trends (EMT) with the previously listed trends of esales.
E-Marketing Trends (EMT)
Brick and mortar
Trends & practices
Technique has not associated trend
Offering of increasingly complex Pick-up speed
online features
Mobile everywhere
Cross border e-sales
Market consolidation
Web banner advertising
Gamification within
e-commerce
Pick-up speed, List of Wishes
Emotions and loyalty
Social-Networking testing
Same day delivery
Search engine optimization
Segment for personalize: the
own shop view for each
individual customer.
List of Wishes, Suggestive selling,
Rewards
Table 4: Trends & practices of e-sales versus e-marketing trends
76
http://esellermedia.com/2014/01/20/ecommerce-trends-expect-2014/
Grant Agreement 315637
PUBLIC
Page 83 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
At this point, it has been proposed to agree to a strategic vision by developing appropriate
e-marketing strategies, derive a tactic vision by deciding for certain e-marketing trends and at
the end an operational vision is implemented by selected trends & practices (Figure 11).
BUSINESS VISION
E-Marketing Strategies
E-Marketing Trends
Trends & practices
STRATEGIC VISION
TACTIC VISION
OPERATIONAL
VISION
Data Mining Techniques
Figure 11: Business vision and e-marketing
In the next section, the data mining techniques are explained that will be used for e-sales, in
order to implement the above-mentioned trends and practices.
4.3 Data mining techniques for e-sales
The essence of data mining lies in the process which is called modelling. Modelling constitutes
a procedure in which a model is generated to outline states whose outcomes are already
noted. The generated model can then be applied on states whose outcomes are unknown
(Çakir, Çalics, & Küçüksille, 2009).The generation of a model is a procedure wherein data
mining algorithms are applied on pre-processed datasets and the usage of complex calculative
methods is capable of providing impressive results.
Data mining methods can be clustered into two main categories (Han, Kamber, & Pei, 2006):
1. prediction and
2. knowledge discovery.
While prediction is the strongest goal, knowledge discovery is the weaker approach and
usually prior to prediction. Furthermore, the prediction methods can be noted into
classification and regression while knowledge discovery can be acclaimed into clustering,
mining association rules, and visualization:
 Classification is a problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data
Grant Agreement 315637
PUBLIC
Page 84 of 144
SME E-COMPASS




D1.1 – SME E-COMPASS Methodological Framework– v.1.0
containing observations (or instances) whose category membership is known. The
individual observations are analysed into a set of quantifiable properties, known as
various explanatory variables, features, etc. For this task can be used decision trees or
induction rules, neuronal networks and discriminant analysis or case-based reasoning
techniques.
Regression is a process for estimating the relationships among variables and its answer
is numerical. In short, that means the variability of the output variable will be
explained based on the variability of one or more input variables.
Cluster Analysis is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters).
Association Rules is utilized to find out associations between different types of
information which can give useful insights.
Visualisation is a proper graphical representation may give humans a better insight
into the data and it is improved by statistical parameters and random events.
These techniques can be applied to any area of data mining but the most notable technique in
the field of e-shop is affinity analysis, used by Amazon Inc.,
 Affinity analysis is a technique that discovers co-occurrence relationships among
activities performed by (or recorded about) specific individuals or groups. In general,
this can be applied to any process where agents can be uniquely identified and
information about their activities can be recorded. In retail, affinity analysis is used to
perform market basket analysis, in which retailers seek to understand the purchase
behaviour of customers. The set of items a customer buys is referred to as an item set,
and are found relationships between purchases through typical rules: IF {} THEN {}. This
information can then be used for purposes of cross-selling and up-selling, in addition
to influencing sales promotions, loyalty programs, store design, and discount plans.
The complexities mainly arise in exploiting taxonomies, avoiding combinatorial
explosions (a supermarket may stock 10,000 or more line items), and dealing with the
large amounts of transaction data that may be available. In addition, this technique
only identifies hypotheses, which need to be tested by neural network, regression or
decision tree analyses.
Note that the selection of the right algorithms and the proper parameterization are factors
capable to determine a project's success.
4.4 Trends & practices vs. data mining techniques for e-sales
The conception and the gradual development of a step-by-step data mining guide by the
pioneers of the data mining market led to the creation of a standard process model to serve
the data mining community (Chapman et al., 2000). The CRISP-DM (Cross-Industry Standard
Process for Data Mining) provides an oversight of the life-cycle of a data mining project
(Plessas-Leonidis, Leopoulos, & Kirytopoulos, 2010).
Grant Agreement 315637
PUBLIC
Page 85 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Moreover, in these types of projects it is crucial to set up which techniques will take part in the
different trends, because in most of the cases these tendencies will become in a functionalities
or requirements of a system. Error! Reference source not found. presents a mapping of the
trends and practices which have been previously described to the data mining techniques. In
this way, the trends and practices and the applied data mining techniques become
transparent.
#
Trends & practices
Data mining techniques
1
Pick-up speed
Classification
2
Social-networking testing
Regression
3
List of wishes
Clustering
4
Cross border e-sales
Clustering and/or classification
5
Suggestive selling
Affinity analysis
6
Web banner advertising
Clustering
7
Rewards
Regression
8
Search engine optimization
Clustering and/or visualization
Table 5: Trends and practices vs. data mining techniques for e-sales
The mapping in Table 5 provides some suggestions for implementing certain trends & practices
as examples. When addressing new trends & practices, the implementation of data mining
techniques need to be considered.
4.5 Commercial products in place
4.5.1 E-shop software
In order to create solutions for the optimization of small e-shops, the first and most important
information is which type of e-shop is used by the company, which features it provides and
which interfaces it offers in order to feed back information from web analysis and data mining.
Grant Agreement 315637
PUBLIC
Page 86 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Depending on the user requirements analysis and the uptake of the different solutions among
the target group of small and medium e-shop owners it will be necessary to conduct a deeper
analysis of product features and especially also the interfaces provided, e.g. for the automatic
management of product prices based on analyses made using web analytics tools or
competitor information from price search engines (see also Error! Reference source not
found. Error! Reference source not found.).
In the following, a number of commercial and also open source e-shop products are listed in
order to give an overview over the current market. Within the course of the user requirements
analysis in WP2 this list will be further qualified in terms of the actual uptake within our target
group.
Cost-free open source e-shop software
Commercial e-shop software

AuctionSieve

1&1 e-shop

Bigware Shop

Comoper.com

FWP shop

Cosmoshop

Gambio

demandware

Intrexx ProfessionalJigoshop

dot.Source

JoomShopping

EKMpowershop

Magento

Gambio Onlineshop

Mondo Shop

Intershop

osCommerce

Mincil.de

Oxid esales

Omekoshop

PrestaShop

Oxid

Shopware

Revido.de

StorEdit

Shopcreator

VirtueMart

ShopFactory

WP e-Commerce

Shopware

StorEdit

Strato Webshop
Grant Agreement 315637
PUBLIC
Page 87 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0

VersaCommerce

XT:Commerce
Table 6: Commercial and open source e-shop software
4.5.2 Price Search
One important way to differentiate an e-shop compared to competitors’ e-shops are product
prices. The more comparable a product is and the more it can be considered as a commodity
product, the more important is the pricing for the success of an e-shop. As the number of
products usually exceeds the range which can be monitored and compared manually pricing
information from price search engines becomes a valuable source of data. In the following,
some of the most prominent price search engines in Europe are listed in order to give an
overview over the market. Usually it is possible to purchase price information, even combined
with product identifiers like the Global Trade Item Number (GTIN) or the International Article
Number (EAN).
Price search engine
Languages
Product Categories
Google Product Search
(google.com/products)
All
broad
idealo.com
English, French, German
broad
Shopping.com
English, French, German
broad
Twenga.com
English, German, French, Spanish, broad
Italian
Pricerunner.com
English, French, German
Nextag.com
English, German, French, Spanish, broad
Italian
Ciao.com
English, German, French, Italian, broad
Spanish, Dutch, Swedish
Shop.com
English, Spanish
broad
Shopmania.com
English, German, French, Spanish
broad
Megashopbot.com
English
broad
Pricegrabber.com
English
broad
Comparestoreprices.co.uk English
broad
Grant Agreement 315637
PUBLIC
broad
Page 88 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Skinflint.com
English
broad
Thefind.co.uk
English
broad
Beslist.nl
Dutch
broad
billiger.de
German
broad
Preisroboter.de
German
broad
guenstiger.de
German
broad
Preissuchmaschine.de
German
broad
Shopzilla.de
German
broad
Geizkragen.de
German
broad
Wir-lieben-preise.de
German
broad
Preisvgl.de
German
broad
Schottenland.de
German
broad
Medvergleich.de
German
Pharmacy only
asesorseguros.com
Spanish
Insurances only
Skyscanner.net
Many
Flights
Swoodoo.com
German, Lithuanian
Flights
Flug-vergleich.flug24.de
German
Flights
flights.idealo.com
English
Flights
Skroutz.gr
Greek
Retail
Table 7: Price search engines in Europe
4.5.3 Web analysis
Previous to the use of commercial products that allow the application of data mining
techniques on the data, it has to achieve these through digital footprint left by the visitor on
the web site.
There is a large volume of professional solutions in the market (Google Analytics, Piwik, and
AWStats) which incorporate the capture of the digital footprint and a process of web analytics.
Web analytics is a set of scientific tools that covers statistics, information technologies, as well
Grant Agreement 315637
PUBLIC
Page 89 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
as the economy, management, marketing principles and several experts systems from other
fields.
A diverse set of tracking tools, which capture the interaction of visitors, is needed to
automatically obtain the information on the digital footprint left by the visitor on the web site.
Currently, tracking tools work at different levels:
1. server, using server logs,
2. client, by a remote agent (JavaScripts or Java Applet) or by modifying the source code
of a web browser, and
3. proxy, through an intermediate level where it stores data between web browsers and
web server (J. Srivastava, Cooley, Deshpande, & Tan, 2000).
Most of the published studies are based on tools based on server logs (Cooley, Mobasher, &
Srivastava, 1997; T. Srivastava, Desikan, & Kumar, 2005; Zaïane, Xin, & Han, 1998); although
more recent studies are using the implementation of a script on the web sites (Pitman, Zanker,
Fuchs, & Lexhagen, 2010; Plaza, 2011; Shao & Gretzel, 2010).
Discussion on the different studied web analytics tools
Currently, there is a large volume of vendors and solutions in the market, which apart from
achieving the gathering of the digital footprint are responsible for performing web analytics
processes, such as: Google Analytics, Piwik, AWStats, Adobe Analytics, etc.
In this section, the discussion will focus on the first three that are the most widely used.
Nonetheless, significant methodological and technical differences can be identified among
them. Respecting the extraction of information, it can be done through server logs or using a
script hosted on the web site.


AWStats collects the navigation trace left in server logs; not allowing direct access to
the data but exposing them through web reports. In obtaining the attributes to
capture, the mentioned system does not register the visited pages that they are
hosted on the server cache. And so, it is not easy tracking the individual cookies and
queries to data hosted on the server, and the user time spent on visiting a page is
inferred by an algorithm.
Google Analytics and Piwik accomplish the capture of the digital footprint through a
script, which it should be hosted on the web site, and whose implementation requires
the involvement of the prescriber of the web. The script transforms the user
interaction in recognizable actions in a database, but does not enable to capture the
page reloads or clicks the button return (J. Srivastava et al., 2000).
Another significant difference should be noticed between Google Analytics and Piwik.
In Google Analytics data is housed on Google servers; so direct access to data is not
facilitated. However, access to an API or through a manual export to spreadsheets or
simple text format is supported. The main limitation is that data are allotted by day
and do not cover the navigation attributes. To the contrary, Piwik provides full access
Grant Agreement 315637
PUBLIC
Page 90 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
to the data and support further analytical possibilities. In this case, the data are
disaggregated and related to the time at which the action is performed. In addition,
Piwik allows data import from Google Analytics.
4.5.4 Data mining suites
In this point, we are going to describe some data mining commercial products that are focused
on the techniques which have been mentioned in the previous step. The best references of
Data Mining Suites (DMS) allow a correct application of the required methods.
These commercial solutions focus largely on data mining and include numerous methods. The
application focus is wide and not restricted to a special application field, such as business
applications; however, coupling to business solutions, import and export of models, reporting,
and a variety of different platforms are nonetheless supported (Mikut & Reischl, 2011).
For this, and taking into account one of the most prestigious web site of data mining called
KDNuggets and their Software Poll (14th)77 and the list of products that appear in the literature
(Mikut & Reischl, 2011) or solutions that are more oriented to cloud services systems and SaaS
(Software as a Service)78, we can establish a new set of commercial products that are
supporting directly or indirectly e-sales.
Product Name
Product Website
Brief Description
RapidMiner
Enterprise
Edition
http://rapidminer.com/
RapidMiner 6 has application wizards for churn
reduction, sentiment analysis, predictive maintenance,
and direct marketing.
SAS Enterprise
Miner
http://www.sas.com/en_us/s
oftware/analytics/enterpriseminer.html
Descriptive and predictive modelling produces insights
that drive decision making.
IBM SPSS
Modeller
http://www01.ibm.com/software/analyti
cs/spss/products/modeler/in
dex.html
IBM SPSS Modeller is a predictive analytics platform that
is designed to bring predictive intelligence to decisions
made by individuals, groups, systems and the enterprise.
No cloud version available.
ADAPA
(Zementis)
http://www.zementis.com/ad
apa.htm
ADAPA is a standards-based, real-time scoring engine
available to the data mining community. It is being used
by some of the largest companies in the world to analyse
people and sensor data to predict customer and machine
77
The 14th annual KDNuggets Software Poll: http://www.kdnuggets.com/2013/06/kdnuggets-annualsoftware-poll-rapidminer-r-vie-for-first-place.html
78
Cloud Analytics and SaaS Providers: http://www.kdnuggets.com/companies/cloud-analytics-saas.html
Grant Agreement 315637
PUBLIC
Page 91 of 144
SME E-COMPASS
Product Name
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Product Website
Brief Description
behaviour in real-time.
STATISTICA Data
Miner
http://www.statsoft.com/Pro
ducts/STATISTICA/Data-Miner
A system of user-friendly tools for the entire data mining
process - from querying databases to generating final
reports.
TIBCO spotfire
Cloud Enterprise
edition
http://spotfire.tibco.com/en/
discover-spotfire/spotfireoverview.aspx
Visualize and interact with data. Analytics at the desk or
on-the-go. On-premises or in the Cloud.
Skytree server
http://www.skytree.net/prod
ucts-services/skytree-server/
It is a platform that gives organizations deep analytic
insights,
e.g.
predict
future
trends,
make
recommendations and reveal untapped markets and
customers.
Table 8: Data mining suites
4.6 Open source data mining products in place
The main open sources of data mining tools are: R and Weka. Both of them might implement
the data mining techniques that appear in the previous point.
R
R is a free software programming language and software environment for statistical
computing and graphics. The R language is widely used by statisticians and data miners
for developing statistical software and data analysis. The R's popularity has increased
substantially in recent years which It facilities to load modules. An interesting model is
apcluster.
The apcluster package implements Frey's and Dueck's Affinity Propagation clustering in
R. The package further provides leveraged affinity propagation, exemplar-based
agglomerative clustering, and various tools for visual analysis of clustering results.
For this case, it would be necessary to deploy a R server for implementing a cloud
system .In order to obtain the results of data mining techniques that can applied by R
programming language.
Weka
Weka is a popular suite of machine learning software written in Java, developed at the
University of Waikato, New Zealand. Weka is free software available under the GNU
General Public License. Its workbench contains a collection of visualization tools and
algorithms for data analysis and predictive modelling, together with graphical user
interfaces for easy access to this functionality.
Weka could be considered as a library or DMS of data mining methods as a bundle of
functions. These functions can be embedded in other software tools using an
Application Programming Interface (API) for the interaction between the software tool
Grant Agreement 315637
PUBLIC
Page 92 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
and the data mining functions.
Cafe
Cafe aims to provide computer vision scientists with a clean, modifiable
implementation of state-of-the-art deep learning algorithms.
Shogun
It is a machine learning toolbox's focus is on large scale learning methods with focus on
Support Vector Machines (SVM), providing interfaces to python, octave, matlab, R and
the command line.
PredictionIO
This tool is an open source machine learning server, which works on cloud.
BudgetedSVM
It is an open-source C++ toolbox for scalable non-linear classification.
Table 9: Open source products in place
Grant Agreement 315637
PUBLIC
Page 93 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4.7 Trends & practices vs. data mining techniques for e-sales vs. data
mining suites
At this final point, we are going to generate the full traceability among trends and practices,
data mining methods and which data mining suites are prepared to implement the techniques
that will carry out successfully the trends and practices for the improvement and maximization
of e-sales for SMEs.
#
Trends & practices
1 Pick-up speed
2 Social-Networking
testing
3 List of Wishes
4 Cross border e-sales
5 Suggestive selling
Grant Agreement 315637
Data mining techniques
Data mining suites
Classification

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,
Regression
Clustering
Clustering and/or
Classification
Affinity Analysis
PUBLIC
Page 94 of 144
SME E-COMPASS
#
Trends & practices
6 Web banner advertising
7 Rewards
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Data mining techniques
Clustering
Regression
8 Search engine
optimization
Clustering and/or
Visualization
Data mining suites

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server

RapidMiner v6 server edition,

ADAPA,

SAS Enterprise Miner,

Skytree server
Table 10: Trends & practices vs. data mining techniques vs. data mining suites
4.8 Research project results and scientific literature
In this Section, a series of related research projects to SME E-COMPASS with their respective
results are described. In addition, following the same subject of data mining techniques for esales, a review of most representative papers in the scientific literature is also performed.
Grant Agreement 315637
PUBLIC
Page 95 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4.8.1 Research Projects
First, research projects have been chronologically listed as they were accomplished in the
scope of CORDIS Framework Programs of the European Commission. In addition, other
research projects from different programs are also listed below.
1. ShopAware - New Methods of E-Commerce: Virtual Awareness and Total Customer Care.
From 2000-01-01 to 2001-12-31. With reference: IST-1999-12361, total cost: EUR 1 332
737, and from FP5-IST program. This project has led new Methods of Electronic Commerce
that bring personal support to Internet based electronic commerce. The combination of
the VP service CoBrow with E-Commerce software and interfaces to the customer relations
management software was used to support the personal staff of commercial web sites.
ShopAware was based on a modified CoBrow vicinity server interacting with database
driven virtual shopping systems. VP was able to build personal relationships between
entrepreneur and consumer in the otherwise lifeless cyberstores. The main objectives of
this project were: Next generation web-based e-commerce systems must integrate
customer-focused sales support tools. Online, live communication with the customer
during the sale and support phases and individually tailored offerings based on knowledge
about the customer build a stable relationship.
2. Intelligent Online Configuration of Products by Customers of Electronic Shop Systems
(INTELLECT). From 2000-01-01 to 2002-03-31. With reference: IST-1999-10375, total cost:
EUR 1 421 537, and from FP5-IST program. The INTELLECT project aimed to contribute to a
new type of trade in the business sector of trading and shopping in Europe. Therefore the
project has developed an electronic shop system including an online configuration module
for products which are represented by 3D / virtual reality techniques and advanced user
assistance and advice to improve the business opportunities for European service
providers and consultants as well as for manufacturers, wholesalers, sellers, and their
customers. INTELLECT objectives were to enable the suitable representation of products
including all practicable variants in electronic commerce systems to achieve the most
realistic possible visualization.
3. Virtual Sales Assistant for the complete Customer Service Process in Digital Markets
(ADVICE). From 2000-01-01 to 2002-04-30. With reference IST-1999-11305, total cost: EUR
2 886 333, and from FP5-IST program. The overall objective of the ADVICE project was the
development and real-world testing of an intelligent virtual sales and service system
beyond simple product listing or intelligent product search. ADVICE offered intelligent
product advice and guides through the selection of products, instructed the application of
products and provided step-by-step solutions for technical problems. The system was
designed for the consulting about craftsman tools, but the architecture was designed to be
as flexible as possible to enable the adaptation of the system to other products or
languages. Objectives: Existing "smart" systems limit consulting to intelligent products
search by case-based reasoning or offer "reactive" dialogues based on reaction to
keywords in the user dialogue. ADVICE developed a knowledge based multiagent system,
which contained detailed knowledge on the products. Customers could communicate with
the system using text input. The system advisor explored the needs of the customer and
explained the products or at after sales service provides product application examples.
Grant Agreement 315637
PUBLIC
Page 96 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4. Local Intelligent Agent as Informed Sales Expert (LIAISE). From 2000-01-01 to 2002-09-30.
With reference: IST-1999-10390, total cost: EUR 2 999 940, and from FP5-IST program. The
LIAISE funded project aimed at producing a commercial tool to aid in the configuration and
quotation of complex highly configurable multivendor systems along the whole systems
value chain. The LIAISE system suggested a new approach to e-commerce providing a new
solution to implement B2B system and consequently new services. The Decision module of
LIAISE uses Multi-Attribute Utility Theory and more artificial intelligence to select the best
products for the user. The Infrastructure Layer was in charge to manage the workflow
created dynamically by the exigency to deal with an user's request for quotation and
providing the timing and correct activation sequence of the services inside an individual
node in the LIAISE scalable architecture.
5. Benchmarking of E-Business Solutions for Western and Eastern Europe SMEs (BENE-BUS).
From 2000-12-01 to 2002-11-30. With reference: IST-1999-29024, total cost: EUR 1 102
404, and from FP5-IST program. BENE-BUS results by its target users (mainly, European
SMEs) a set of services were designed, which were accessible through the WEB and
provided by the trans-national consortium. These can be summarized as follows: essential direct support services to enable SMEs to implement innovative business
processes based on e-business solutions implementation; a service supplier database of
existing organizational/technical resources for supporting e-business processes of SMEs
and to construct alliance networks enabling SMEs to operationally implement the eplatform solutions.
6. Practical Knowledge Management to support Front-line Decision making in SMEs. From
2001-01-01 to 2002-06-30. With reference: IST-1999-56403, total cost: EUR 869 212, and
from FP5-IST program. The aim of the project was to develop an intelligent, web-based
system (implemented as a portal solution) to support front-line decision-making in SME
companies. To use this system to help front-line workers to make better -and more
profitable- business decisions, avoid wasting time and money resolving problems and
increase customer satisfaction. To improve products and services by making use of the
knowledge obtained from the front lines, and to support SMEs to provide to their
customers Internet-based self-service capabilities.
7. Transforming Utilities into Customer-Centric Multi-Utilities. From 2001-01-01 to 2003-0228. With reference: IST-2000-25416, total cost: EUR 2 881 002, and from FP5-IST program.
This project aimed at developing solutions that enable utilities to provide the European
consumer with better services, in a flexible manner. The project addresses a specific
business requirement: How can utilities, driven by deregulation initiatives, be transformed
so that they offer a more competitive set of services. The "e-utilities" project utilize
technologies such as knowledge modelling and business change support, data
warehousing and mining, and e-commerce, so as to deliver the following components:
Customer Profiling, Virtual Utility Market, Virtual Utility Shop, and Standards for Change.
8. BusIness ONtologies for dynamic web environments. From 2002-01-01 to 2003-12-31.
With reference: IST-2001-33506, total cost: EUR 1 980 000, and from FP5-IST program.
BIZON was an innovative approach to dynamic value constellation modelling and
governance for e-business. The main goal was to design and build a knowledge founded
framework (consisting of ontologies, knowledge bases, semantic web, web data mining,
Grant Agreement 315637
PUBLIC
Page 97 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
machine learning) able to support and optimize a business environment characterized by
product personalization, demand anticipation and process self-organization. This e-value
ontology described production and exchange processes, in the broadest sense. It
encompassed marketing aspects (value perception had a lot to do with web data mining),
production aspects (value generation implied planning and scheduling of production
processes) and other aspects, e.g. legal ones. Important concepts treated in the e-value
ontology were time, planning and scheduling, products variation and combination,
products and services personalization of buyers and anticipation of market trends.
9. Personalizing e-commerce using web mining. From 2000-09-01 to 2004-08-31. With
reference: HPMT-CT-2000-00049, total cost: EUR 158 400, and from FP5-HUMAN
POTENTIAL. The aim of PERSONET was to provide training to PhD students from across
Europe on "Personalizing E-Commerce using web Mining". Places were available for three
or four fellows per year for four years. Individual fellowships were for three to twelve
months in duration. A selected fellow had the opportunity to work in a vibrant culture with
opportunities to participate in theoretical training courses and gain practical skills through
working with a leading research team on EU funded projects.
10. Analysis of Marketing Information for Small-and Medium sized Enterprises. From 200409-16 to 2006-09-15. With reference: 5875, total cost: EUR 1 463 202, and from FP6-SME.
AMI-SME aimed to provide a solution for the specific information requirements of SMEs
which face the challenge of get sound information as a base for future-proof decisions in
the field of marketing and sales.
11. E-Sales Research Project: Active Selling Through Electronic Channels and Social Media
05/2010 – 05/2012. http://www.e-sales.fi/esales/. The main object was to increase sellingcentered know-how and competitiveness among Finnish companies through conducting
high quality international research and seeking branch-specific best practices in the field of
electronic selling and selling-intensive social media. The E-sales project was funded by
Finnish Funding Agency for Technology and Innovation (TEKES) and partner firms. Tekes is
the main public funding organisation for research and development in Finland.
All these projects have been developed with the general target of creating and enhancing
e-sales/e-commerce platforms. Nevertheless, a series of differences can be found with regards
to SME E-COMPASS functionalities. Applications in past projects were focused on a few specific
SMEs functionalities or on previous existing applications. In the case of SME E-COMPASS, core
functionalities of analytic applications are designed from a generalist perspective trying to
cover all initial requirements of a great number of SMEs in European regions. .
4.8.2 Scientific Literature
Additionally to the previous results in the scope of research projects, a series of related works
can be found in scientific literature, where several interesting books, journals and conference
papers have appeared from 2000 to the date. This literature can be classified in terms of few
topics, comprising: classical mining e-commerce data and web mining in general, market
Grant Agreement 315637
PUBLIC
Page 98 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
basket analysis, works focusing on real-world applications (Telecommunications, Tourism,
etc.), and works based on big data analysis.
In terms of classical techniques, there exist from the last decade a number of works on which,
machine learning algorithms are applied for mining e-commerce data and web based scenarios
with the aim of extracting implicit information of client’s activities and market’s fluctuations.
As start point in this review, it is worth mentioning the book Data Mining Techniques: For
Marketing, Sales, and Customer Support, which at its first edition (1997), compiled for the
first time a great number of data mining techniques for marketing and sale. Besides, in this
early edition e-commerce web based systems were not considered, in its last edition (2011)
(Berry and Linoff, 2011), this book contains several use cases of web mining techniques for esales and e-shopping information. After the first edition of this book, a number of scientific
works have appeared that directly tacked the analysis of e-commerce information from a data
mining point of view (Kohavi, 2001, Ansari et al., 2001, Linof and Berry, 2001, Kohavi et al.,
2004, Lee and Liu, 2004, Ting and Wu, 2009).
More recently, advanced studies concerning the customer’s opinion and sentiment analysis
(Sadegh et al., 1012, Rahi and Thakur, 2012, Dziczkowski et al., 2013) have become very
popular, since they provide induced information about new implicit tendencies of users. In
addition, surveys (Pitman et al, 2010) and taxonomies (Zhao et al., 2013) of web data mining
applications can be found that gathered and ordered existing literature on this matter. A
special mention could be made to works providing services in real world industry initiatives
(Ting and Wu, 2009). In this last regard, several successful examples can be found in the fields
of telecommunications (Oseman et al., 2010), and tourism (Pitman et al., 2010, Xiuhua, 2012,
Ge et al. 2014).
More concretely, Market Basket analysis (Dhanabhakyam and Punithavalli, 2011) is one of the
most interesting subjects in e-commerce/e-sales, since it allows examining customer buying
patterns by identifying association rules among various items that customers place in their
shopping baskets. The identification of such associations can assist retailers expand marketing
strategies by gaining insight into which items are frequently purchased jointly by customers. It
is helpful to examine the customer purchasing behavior and assists in increasing the sales and
conserve inventory by focusing on the point of sale transaction data (Dhanabhakyam and
Punithavalli, 2011). In this sense, current works are focusing on managing large amount of data
(big data) (Woo, 2012) to find these kind of association rules and assisting the experts on ecommerce extensive platforms (e-bay, Amazon, etc.).
Finally, new trends in web mining analysis are mainly focused on the use of big data and cloud
computing services (Buchholtz et al., 2013, Woo, 2012, Kawabe, 2013, Rao et al., 2013,
Russom, 2013).It allows to manage large repositories of data commonly generated in current
web e-commerce services and associated social networks. In this sense, the analysis of
customer’s behaviors and affinities in multiple linked sites of e-shopping, social networks, emarketing, security and online payment tools in digital ecosystems constitutes one of the most
promising research areas at present (Kawabe, 2013, Damiani, 2007, Bishop, 2007).
Grant Agreement 315637
PUBLIC
Page 99 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4.9 Weaknesses and limitations of current practices compared to SME
needs
In this section, the identified trends and practices are examined by concerning the needs of
very small e-shop owners. Entailment (2013) mentions an increase of competition due to a
market consolidation. Therefore, e-shops need to examine their visibility at their target groups
and consider developing a new positioning if required. Also an internationalisation of the own
e-shop in order to address new markets may be helpful to expand the own business.
Additionally, improved processes and an increased efficiency may reduce cost and attract new
0%
20%
40%
60%
72%
Search engine optimization (SEO)
55%
Search engine advertising (SEA)
Newsletter (regular E-Mailing)
45%
Social Media
45%
32%
Press work
E-mail-marketing (unregular marketing
campaigns)
80%
71%
75%
75%
54%
27%
25%
Adds in newspapers and magazins
Banner advertising (without onlinevideo-advertising)
18%
13%
Affiliate-/sales partner programms
10%
Online-video-advertising
44%
Small Eshops
43%
39%
38%
Medium
and large
E-shops
21%
8%
8%
Other advertising
Other print advertisinig (flyer, etc.)
100%
32%
35%
Price Comparison Sites
TV- and radio advertising
80%
1%
9%
5%
6%
visitors in the e-shop.
Figure 12: Which marketing activities do you conduct in order to attract visitors to your e-shop (Bauer et
al., 2011)
When having a closer look at Error! Reference source not found.2, the differences between small
e-shops in comparison to medium and large e-shops in the intensity of marketing activities
become obvious. Only the marketing activities in search engine optimization, newsletter and
participation at price and product comparison web sites are of similar intensity small and
Grant Agreement 315637
PUBLIC
Page 100 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
medium/large e-shops. All other activities are significantly less used by small e-shop owners.
Those activities negatively influence the visibility of the small e-shops and thus the revenues.
Two main strategies for small e-shop owners to compete against their competitors are:
1. Increasing visibility and bring more visitors to the own e-shop:
In this case the marketing activities need to be intensified which requires more
resources in terms of man power and budget which are most often very rare at small
e-shops.
Some latest ideas for improving the visibility are mentioned when experts forecast the
trends of 2014. Peters (2013) and Rönisch (2013) emphasize the importance of
electronic marketplaces which allow the e-shop owners to participate in web-based
ecosystems which have established a great number of recurring visitors and customers
(Dukino & Kett, 2014). Especially small e-shop owners may have a benefit when
participating in such web-based ecosystems. Concerning Rönisch (2013), eSeller Building your digital business (2014) and Hesse (2013), multi- or even omni-channel
presences are increasingly developed in order to address the customers over those
channels which they are used to. Here, customer-centricity is the key objective which
e-shops try to achieve in order to grasp the customer and motivate him to buy. eSeller
- Building your digital business (2014) created the image of “the customer who is now
a multi-platform hoping beast and you have to be everywhere to catch their coins as
they leap from plinth to plinth. Customers these days take their time over shopping
and want to do it when and where they chose across multiple devices.” Therefore,
Entailment (2013), rakuten (2013), Peters (2013), Rönisch (2013), Charlton (2013) and
Elizabeth (2014) addresses the topic of mobile services. The term of mobile services
has got various facets such as location-based services, mobile payment methods,
responsive design, device-first thinking, and hyper-targeting by sending visitors of a
shop messages on his mobile device with useful information about the shop, its
products and services.
2. Harvesting the visitors who enter the own e-shop:
In order to harvest the visitors, a better understanding of the visitors’ motivation and
expectations when entering the e-shop needs to be developed and high-quality and
personalized content may be presented to attract the visitors’ interest.
For example, the trend towards more emotions and loyalty may feature a clear and
personal profile of the e-shop which addresses certain target-groups, improved
services which are valued by the target groups, interesting and well-presented content
which attracts the target-groups, and the possibility to share and interact with other
visitors in the e-shop (Entailment, 2013). The Ferrero-principle pushes the
development of own brands, the development of exclusive offers, and heads towards
a differentiation strategy (Entailment, 2013). Personalization enables the e-shops to
present content depending on the preferences of the visitors. The visitors are
influenced in their buying-decision by many different factors and channels. The way
how the buying-decision is influenced stays very often unknown for the e-shop
Grant Agreement 315637
PUBLIC
Page 101 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
owners. However, the information about the visitor journeys which is already available
within the e-shops puts pressure on the e-shop owners to analyse them before the
competition does this (Elizabeth, 2014; Entailment, 2013; rakuten, 2013). Big data and
its usage are addressed by most of the experts when discussing trends in e-commerce.
The creation of user profiles is currently getting more and more important. On the
basis of the user profiles the visitors of an e-shop can be personally addressed (Peters,
2013).
Data are taking the lead, the small e-shop owners need to understand how to make use of the
big data (Charlton, 2013; Elizabeth, 2014; Hesse, 2013; Rönisch, 2013). In this case, the small eshops need tools which suits them. Google Analytics is the most wide spread analytics tool
referring to Bauer et al. (2011) who identified two third of the e-shops applied this tool. The
top 3 requirements for such an analytics tool are high usability (59 percent), fast analysis (51
percent) and comply to the data protection laws (48 percent) (Bauer et al., 2011). The
provided web analytics tools only partially meet the requirements of small e-shops. Figure 13
shows the reasons why e-shops don’t use web analytics. The first two reasons illustrate the
complexity of the topic and which concerns 40 to 50 percent of the e-shop owners. One third
of the e-shop owners claim that the web analytics tools are too expensive.
0%
20%
40%
to little time for gathering
and analysing the data
51%
missing know-how
40%
too expensive
30%
data protection reasons
reasons for not
applying Web
analytics
12%
have nothing heard about
Web analytics
8%
no benefit
other reasons
60%
5%
1%
Figure 13: Why don't you use a web analytics tool? (Bauer et al., 2011)
The complexity also becomes obvious when examining how often the e-shops analyse their
web metrics. More than 50 percent of the small e-shops claim that they conduct web analytics
monthly or very irregularly (even less often) (Bauer et al., 2011).
In conclusion, in order to attract more visitors to the own e-shop and to offer them
personalized content depending on the visitors’ needs, a better understanding of the visitors
of an e-shop becomes more and more a key factor for a successful e-shop. However,
understanding the visitors means to be able to analyse the visitors’ behavior in the e-shop.
Small e-shop owners need to overcome the complexity of web analytics and the hurdle of
Grant Agreement 315637
PUBLIC
Page 102 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
developing the appropriate know-how of their usage. In order to understand the visitors’
behavior and conduct appropriate actions, the project SME E-COMPASS should provide a
support and an easy-to-use tool to facilitate the usage of web metrics, enrich existing web
metrics by additional data sources in order to derive appropriate actions, and appropriately
visualize the data and the action towards a decision support system.
Grant Agreement 315637
PUBLIC
Page 103 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
5 From Knowledge Harvesting
Methodological Framework
to
Designing
E-COMPASS
The main purpose of this section is to specify the objectives that have to be addressed by the
project in the field of e-commerce applications for secure transactions and increase of sales for
SMEs. These technological and scientific objectives build a foundation for the subsequent work
packages, providing the basis for WP2, WP3, WP4 and the following activities. These
foundation principles and the specific objectives will also guide all evaluation activities (WP6).
5.1 Technologies Pre-selection
In this sections a description of the technologies and techniques that are pre-selected and will
be implemented in WP3 and WP4 are briefly presented.
5.1.1 Anti-fraud System
The nearly two decades of development for fraud monitoring systems have witnessed a
flourishing of different types of technologies with often promising results. In the early years,
fraud detection was accomplished with standard classification, clustering, data mining and
outlier detection models. Researchers soon realized the peculiarities of the problem domain
and introduced more advanced solutions, such as nature-inspired intelligent algorithms or
hybrid systems. The latter stream of research advocates the combination of multiple
technologies as a promising strategy for obtaining a desirable level of flexibility. First results
from the adoption of this practice to real-life e-commerce environments seem encouraging
(see section 3.2.5). Still, how best to fine-tune a hybrid system presents a challenge to the
designer, as it very much depends on performance aspirations (cost-efficiency vs. prediction
accuracy) and the conditions of the operating environment79. This is one of the issues to be
considered by the partnership of WP3.
Our proposal for an automatic fraud detector follows the hybrid-architecture principle, in the
spirit discussed above, and is schematically depicted in Figure 14. A more detailed description
of the functionalities of each module is given in section Error! Reference source not found..
The system has two major components: the inference engine and the knowledge database
(DB). The knowledge database consists of various types of fraud detection systems or
techniques (expert rule engine, supervised learning algorithms or anomaly detectors), whereas
the inference engine is the coordinator of the classification process.
79
See also section 3.5.8.
Grant Agreement 315637
PUBLIC
Page 104 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Each newly arrived order flows through the inference engine and receives a risk score (RS)
depending on its characteristics. This score reflects the confidence by which the order can be
regarded as fraudulent and ranges between 0% (the transaction is genuinely normal) and
100% (the transaction is extremely risky). The use of a smooth grading scale generally
facilitates the handling of borderline cases and naturally resembles the scoring process
followed by human experts.
Once scoring is completed, the transaction is routed according to the three-event frauddetection protocol illustrated in Figure 15:
1) If the risk score is below a lower cut-off point (COPL), the order is accepted and
executed automatically.
2) If the risk score is above an upper cut-off point (COPU), the order is rejected without
further notice.
3) If the risk score lies between COPL and COPU, the order is sent to fraud analysts for
further investigation.
A good design practice is to choose a close-to-zero value for COPL and a near one value for
COPU. This way one restricts the possibility that a fraudulent transaction is mis-regarded as
normal (false negative assessment) and a legal order is falsely denied (false positive error),
respectively. However, increasing the spread between COPL and COPU, we end up with more
and more orders falling into the “grey” zone. We thus create more need for human
intervention and effectively reduce the benefits of automating the fraud detection process.
Instead of setting the decision boundaries arbitrarily, we adopt a data-driven approach that
takes into account several parameters of the business environment in which the system is
meant to operate. The general idea is to choose the values of COPL and COPU that result in an
optimal system behavior with respect to one or more performance metrics set by its manager
(fraud detection rate, the ratio of false negatives to false positives, misclassification cost, etc).
Grant Agreement 315637
PUBLIC
Page 105 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Figure 14: A schematic description of the anti-fraud system functionalities and architecture.
Knowledge DB
Incoming
order
Expert
system
Inference
engine
Supervised
learning
techniques
Risk scoring (RS)
TAT
Anomaly
detector
Final
classification
Grant Agreement 315637
Transactions
DB
Experts
Cutoff points
PUBLIC
Page 106 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Figure 15: The order evaluation process.
RS< COPL
Risk score (RS)
Inference
engine
GO
GO
COPL < RS < COPU
Final
Classific
ation
STOP
STOP
RS> COPU
Grant Agreement 315637
PUBLIC
Fraud
analysts
Page 107 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
For the Anti-fraud system-service the following technologies are pre-defined:
1) Expert systems. In the context of SME E-COMPASS, an expert system would consist of
multiple rules-of-thumb for assessing the riskiness of each transaction. Knowledge can be
encoded in the system in various forms; as a set of IF-THEN rules activated in parallel (see
e.g. section3.2.2) or as a hierarchical (tree-like) structure, in which transaction parameters
are analysed sequentially according to their importance.
2) Supervised learning techniques. A variety of supervised learning models, revised in section
3.2.3, can be used to extract patterns of fraudulent activity from the transaction database
(DB). However, the design plan of any supervised classifier should also provide clear
guidelines with respect to the following implementation issues:
a) How to create training/validation data sets from a possibly big pool of transactions
b) How to reduce the dimensionality of the feature space and
c) How to efficiently cope with the case imbalance problem, which presents an obstacle
to the application of knowledge extraction techniques (see section3.5.6).
3) Anomaly detectors. Anomaly detectors are well suited for online fraud monitoring, as they
do not typically rely on experts to provide signatures for all possible types of fraud. Among
the great range of candidate technologies, we particularly favour the application of hybrid
(semi-supervised) novelty detectors, combining statistical techniques with computational
intelligent models (see also section 3.2.4). These often present an effective means of
detecting outliers in complex and large-dimensionality data spaces arising from the
analysis of typical transaction databases.
4) Inference engines. The purpose of the inference engine is to coordinate the risk
assessment process and provide an aggregate suspiciousness score through which each
transaction can be classified in predefined categories (normal, malicious, under review).
An inference engine performs a variety of operations, such as:
a) Analysing the transaction parameters and converting the attributes array to a format
understandable by the base classifiers.
b) Isolating the most prominent attributes of each transaction to be considered by each
classification model (feature selection).
c) Selecting the set of scoring rules applicable to each type of good/service or each
market segment (rules customization).
d) Consolidating the outputs generated by each independent module of the knowledge
database (expert system, supervised classifiers, anomaly detectors).
e) Resolving possibly conflicting verdicts (e.g. by taking into account the credibility of
each base classifier).
Grant Agreement 315637
PUBLIC
Page 108 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
5) Transaction analytics (TA). TA technologies typically provide the fraud analyst with
technical or geographical information about each transaction and thus supplement in
many ways traditional background investigations on customer profiles. Non-conventional
aspects of the transaction that can convey valuable information about its validity are
device configurations, web browser settings, spatial displacement between
IP\contact\shipping address, issuing bank details, site navigation patterns, number of
unsuccessful payment attempts with the same card, etc.
5.1.2 Data mining for e-Sales
Many e-shops use the freely available Google Analytics tool to analyse and visualize relevant
metrics for controlling their e-shop activities. However, Google Analytics mainly monitors the
activities which lead the traffic into the e-shop, e.g. campaigns. Many e-shop owners don’t
monitor the activities on the e-shop very intensively to harvest the visitors who entered the
e-shop.
The fundamental idea behind the SME E-COMPASS online data mining services is to support
small e-shops in increasing their conversion rates from visitor to customer by improving the:




understanding of the customers and their expectations/motivation,
knowledge about competitors and their activities, especially concerning their prices
and price trends,
examination of potentials for improvements by analysing some selected information
of both, customers and competitors,
initiation of appropriate actions depending on the identification of certain patterns in
the analysis results above-mentioned.
In order to implement a solution which supports the above-mentioned features the following
five modules are developed:
a)
b)
c)
d)
e)
Data collection and consolidation
Competitor price data collection
Business Scorecard – optimization potential analysis
Automated procedures by applying rule-based actions
Visualization – SME E-COMPASS cockpit
Grant Agreement 315637
PUBLIC
Page 109 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Figure 16: Data Mining SME E-COMPASS Architecture
Grant Agreement 315637
PUBLIC
Page 110 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
a) Data collection and consolidation
A RDF repository integrates all required data from different-format data sources and making
them available to the services developed into the project. The data integration is done by using
RDF as the data model.
Integrating data from multiple heterogeneous sources entail dealing with different data
models, schemas and query languages. The data collection and integration also provides an
interface to the Web analytics metrics of the E-shops. An OWL ontology will be used as
mediated schema for the explicit description of the data source semantics, providing a shared
vocabulary for the specification of the semantics. In order to implement the integration, the
meaning of the source schemas has to be understood. Therefore, mappings are defined
between the developed ontology and the source schemas.
In case of data mining for e-sales, the RDF repository stores data about online transactions,
user registries and all data required and produced by the data mining algorithms as well as
third-party data, to produce integrated data using RDF as common language and the ontology
as common domain model (data schema). These integrated RDF data will be translated to a
format that data mining tools can understand (for example, ARFF as used in Weka platform) to
enable the analysis of the data.
Furthermore, this integrated RDF data are connected with the data warehouse. Additionally,
data cleaning and filtering in order to consolidate the different data in the RDF repository will
be provided. This process will be supported by the ontology developed in WP2, and it will
consist on structuring keywords in raw data (filtering noisy information), and hence providing
semantic meaning to these data.
Service features:




Collection of different data: data from the own e-shop, data from competitors e-shops,
third-party data which could be relevant for the rest of modules and which are
available as open-data, as well as the results of the different data-mining algorithms
Semantic Data Cleaning and filtering
Consolidation of all data in an RDF repository
Query interface to recover data from the RDF repository
Technical service components:




Data import from web analytics
Virtuoso RDF database
Translation to RDF from other format services
Export interface to data warehouse
b) Competitor price data collection
In order to understand the performance and certain trends in the own e-shop, external
information in the e-shops of competitors are scraped, such as prices on competitors’ product
pages and pages of terms and conditions, which have been specified by the e-shop owner.
Grant Agreement 315637
PUBLIC
Page 111 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
The observation of the competitors can be made at a general level, i.e. e-Shop, where the
intensity of the presence and activities of the competitors in social media could be measured.
When focusing on a product level, concrete products and services could be compared, e.g.
price variations or other structural variables such as terms and conditions.
For those products that are not comparable by a concrete identifier, the e-shop owners can
specify any product which is price-relevant for the own product(s). Thus, the service is able to
provide price data information or price changes to analyse the dynamics of prices.
Service features:



Scraping of product prices (own/competitors) on a daily basis by
o identifying the appropriate product pages on the web site of the competitors
with a high degree of automation
o analysing and properly identifying the relevant information on the pages, e.g.
product name, product ID, product prices, and product availability, before
their scraping.
Checking for changes in any other relevant offerings (i.e. in shipping conditions of
competitors, return conditions, general terms & conditions) on a daily basis (check
whether changes have occurred, i.e. not the analysis of the content/changes)
Provides competitors’ price information for the Business Scorecard (analysis) and
delivers the price information in form of RDF to the RDF repository.
Technical service components:


Product price (own/competitors) scraper for e-shops
External VPN service
c) Business Scorecard – optimization potential analysis
SME E-COMPASS offers a visitor segmentation service using behaviour-based clustering
techniques. For a specified period of time the e-shop owner will be able to examine the
demand/motivation of his/her visitors based on categories of their behaviour. For example,
visitor clusters could be defined, such as explorers (searching deeply in an e-shop and its
content with a strong focus), clueless (searching without an identifiable focus), buyers (placing
an order), etc.
In order to identify the motivation of visitors for entering the e-shop the visitor clusters are
built on the basis information of the visitor behaviour, such as opened pages, time of stay,
bounce rate, search terms which have been entered in the external or internal search engine,
etc. The benefit of understanding the visitors’ motivation is to develop target-specific sales
strategies.
The observations are made based on a spatial and temporal axis, i.e. the behaviour is
examined by market to a level of disaggregation of place (such as country and city), as well as
the information of time.
Grant Agreement 315637
PUBLIC
Page 112 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Taken the collected metrics, correlations are derived, e.g. reduction in orders or number of
visitors of a specific product in the own e-shop due to a price reduction of a relevant
competitor in his/her e-shop. In this case, internal metrics are combined with external metrics
in order to provide new insights.
The ultimate goal is to improve service performance and increase sales. After analysing the
internal and external metrics, potential for improvement the activities of an e-shop should be
identified.
Service features:




Storing variables or metrics which are provided by the module of data collection and
consolidation for the data warehouse
Data processing to carry out the data quality assurance
Applying data mining techniques to analyse the cleaned and stored data
Generate the Business Scorecard
Technical service components:



Data quality assurance
Datawarehouse
Business intelligence (BI) service and Business Scorecard (BSc)
d) Automated procedures by applying rule-based actions
Additionally, a rule-based solution is built to check the collected data for defined patterns in
order to conduct actions which improve the e-sales activities of an e-shop. The rules are
executed on the business figures of the BI service or the semantically connected data of the
RDF repository.
The challenges of the service are to define a good set of predefined rules and actions for the
users and the identification of relevant data from the knowledge bases (BI service and RDF
repository).
Service features:



Configuration API for import of configuration parameters from ECC
Identification presets of rules which facilitate the identification of certain very relevant
patterns within the collected data
Development of presets of defined actions, such as alerts, notifications, etc.
Technical service components:

Rules engine (e.g. event-condition-actions (ECA) rule engine or logic based rules
engine)
e) Visualization – SME E-COMPASS cockpit
The SME E-COMPASS cockpit (ECC) provides the user interface of all above-described services.
It is the single point of contact for the e-shop owner and the place where he is able to setup
Grant Agreement 315637
PUBLIC
Page 113 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
the configurations of all SME E-COMPASS services, e.g. the competitor analysis and the rulebased actions. The ECC also provides all information generated by the SME E-COMPASS
services to the e-shop owner in appropriate and understandable form and visualization.
Challenges in the implementation lie in the integration of the ECC with all other services in a
bidirectional way, i.e. the user can control and configure the attached services via the ECC and
gets alerts, notification, KPIs, competitor information and statistics as well as historical data of
interest via the cockpit. As the ECC provides the single point of contact and information for all
E-COMPASS customers, the cockpit needs to be a multi-tenant service with strict separation of
customer data. In order to be easy to use with a target group of small and medium enterprises
it is fully web-based. The implementation should allow for easy scalability with increasing
numbers of customers.
Service features:




Control and configuration of all other SME E-COMPASS services within the range of
their provided features
Single point of information for all other SME E-COMPASS services
Visualization of data analysis results
Display of alerts and notifications for necessary or recommended actions to be
proceeded by the E-Shop owner
Technical service components:




Multi-tenant web portal
Dashboard visualization components
Graphic visualization service that prepares incoming information from other SME
E-COMPASS services for display in the cockpit
APIs for service control and information import
For each of the five modules there are several different technical implementations possible. In
the following the different options for the modules will be sketched.
a) Data collection and consolidation
The implementation of the data collection services depends very much on the current
situation of the E-Shop that is to use the E-COMPASS data mining services. There is a high
probability that many E-Shops will already use Google Analytics as web analysis solution.
However, tools such as Piwik might be another option that needs to be taken into account. The
user requirements analysis will provide the information on which a final decision can be made.
A second source of information for the data mining services will be the competitor information
on prices as well as terms and conditions. Both data sources will be connected to a data
cleaning and consolidation service that is to be developed within the E-COMPASS project and
which will feed consolidated web analysis and competitor scraping data into an RDF repository
based on the Virtuoso RDF database.
Grant Agreement 315637
PUBLIC
Page 114 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
b) Competitor price data collection
In order to collect competitor pricing data as well as changes in shipping conditions, delivery
conditions and general terms and conditions a web scraping engine will be used. The web
scraping engine will need to allow for configuration via the E-COMPASS cockpit. One technical
solution possible for web scraping may be Arcane, an engine developed at Fraunhofer IAO. The
user interface component for defining the scraping target, i.e. the competitor E-Shop’s product
and terms web pages, will need to be integrated into the E-COMPASS cockpit which will be
based on web portal technology such as e.g. Liferay.
c) Business Scorecard – optimization potential analysis
In order to implement the core of the data mining services a second step of data quality
assurance will be developed within the project which will also take care of extracting all the
relevant information for the business intelligence services and inserting them into a data
warehouse via ETL (extract-transform-load). Data warehouse solutions are available from a
wide range of renowned software vendors. For the data mining solutions that will access the
data warehouse and process the data available there in order to derive the key performance
indicators to be displayed in the business score card service sections 4.5.4 and 4.6 give an
overview over the commercial and open source solutions possible. The final decision for a
specific solution will take the results of the user requirements analysis into account.
d) Automated procedures by applying rule-based actions
There are several technological possibilities to implement a system that is capable of carrying
out rule-based actions as a reaction to correlations found by the E-COMPASS data mining
modules. Depending on the size of data sets and the number of parameters as well as the
reaction timescale needed by SME E-Shop owners several solutions may be taken into account.
For real-time processing of large-scale event messages the implementation of a message
queuing system in combination with a complex event processing engine might be necessary.
However, the indicators so far available show that a lighter version of business rules engines
might be more appropriate as the events on which actions need to be taken within ECOMPASS tend to be changes in correlations between the overall user behaviour of the own EShop and the information on competitor price trends. As this will probably not produce such
large-scale event data the simpler concept of event-condition-action (ECA) engines will
probably be sufficient. Depending on the intrusiveness that E-Shop owners are willing to
accept – results will be available from the user requirements analysis – a connection of the ECA
engine to a mailing server and the ECC cockpit will be sufficient.
e) Visualization – SME E-COMPASS cockpit
The central component for user interaction will be the E-COMPASS cockpit. The basis will be
web portal technology such as the open source solution Liferay which offers the option of
commercial support as well. This can be combined with Dashboarding technologies and allows
Grant Agreement 315637
PUBLIC
Page 115 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
for a relatively easy integration of different web based data analysis components e.g. also the
user interface for the web scraping.
5.1.3 Semantic web Integration
Semantic web Technologies will be used in the project to enable the integration of the data
and the interoperability of the developed algorithms.
Linked Data
Linked Data will be used in the project to retrieve parts of the information required in the
different algorithms. An RDF repository will be developed using Virtuoso as RDF database and
using an OWL ontology (specifically developed to the projects requirement) as common data
model. This repository will be queried by means of SPARQL queries.
Ontologies
The ontology that will cope with the data representation needs of the project will be
developed following METHONTOLOGY (Fernandez et al, 2007) methodology and web Protégé
(Tudorache et al, 2008) for the collaborative improvement of the semantic model.
Web ontology languages
The project ontology will be written in OWL as ontology definition language. Mapping between
the ontology and the data sources will generate RDF triples. Therefore, instances of the
ontology will be RDF triples.
Grant Agreement 315637
PUBLIC
Page 116 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
5.2 Objectives
In the following paragraphs are presented the scientific and technological objectives of SME ECOMPASS methodological framework, structured per application.
5.2.1 Anti-Fraud System’s Objectives
The scientific and technological objectives for the anti-fraud system that will be designed and
developed in the context of the project and to be used by European SMEs are the following:
1. Extracting common fraudulent behaviours. Our aim is to analyze big volumes of data
already available by online shops and extract the principal components characterising
fraud activity (i.e. those transaction attributes that convey important information to
the fraud analyst). Through networking actions with domain experts, we hope to
facilitate the exchange of knowledge and best practices in fraud management, in
which direction no significant progress has been made over the last years (see
section 3.5.3).
2. Disseminating novel patterns of cybercriminal activity. The processing of up-to-date
transaction data will allow us to extract and subsequently disseminate to online
merchants possibly new tactics that cybercriminals have developed to commit
payment fraud.
3. Developing hybrid system architectures. The smart blending of fraud detection
techniques is currently gaining much attention in the literature, as a way of
overcoming the deficiencies of individual state-of-the-art technologies and
addressing the peculiarities of the fraud detection application domain. This is also the
approach adopted by the SME E-COMPASS project. We are aiming at experimenting
with different levels of hybridization, for instance
a) combining supervised learning with anomaly detection techniques
b) using intelligent optimization heuristics to fine-tune the parameters of fraud
detectors on non-standard performance metrics or
c) using rule-inductive algorithms to facilitate the interpretation of less
transparent classification models.
We particularly favour the use of nature-inspired intelligent algorithms, such as
particle smarm optimization, differential evolution and artificial immune systems, as
standalone detectors or as part of a hybrid transaction-monitoring system. All the
afore-mentioned technologies will form an integral part of the knowledge database,
the “brain” of the fraud detection system.
Grant Agreement 315637
PUBLIC
Page 117 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
4. Improving the readability of the automated fraud detection process. Through the use
of symbolic data mining techniques, we aim at offering simple guidelines, in the form
of association rules or decision trees, which experts can utilise to evaluate online
transactions or to sketch the profile of fraudsters.
5. Creating an adaptive fraud-detection framework. As analysed in section3.5.2, a big
challenge for anti-fraud systems is how to cope with a dynamic business
environment, where fraud and normality definitions change over time. Attaining a
good level of adaptivity is also a priority goal for our proposed architectures.
6. Improving the cost-efficiency of the overall fraud detection process. Cost-efficiency
requires a holistic view of fraud detection taking into account the cost of both
manual and mechanical operations (see section 3.5.8). In the context of SME ECOMPASS, we will address this issue by providing economically-optimal design
parameter settings for fraud monitoring systems supplemented by cost-effective
practices for manual reviewing.
7. Exploitation of cross-sectoral data and global information sources. One of the goals of
the SME E-COMPASS is the development of a Transaction Analytics Toolkit (TAT)
that will facilitate fraud detection by highlighting technical and geospatial aspects of
each transaction. Through the development of TAT we aim at streamlining traditional
risk monitoring practices (such as manual examination of credit card details or client
profile) and also promoting the efficient usage of publicly available cross-sectoral
data and global information sources.
8. Software-as-a-service application. Our vision is to create a web service through which
various merchants and fraud professionals can screen online transactions and gain
extra knowledge on current cyberfraud practices. This is expected to have minimal
requirements for “in-house” computational resources and technical expertise. An
integral part of the anti-fraud service will be the reputation database which will
include fraud indicators in the form of classification or scoring rules, discussed in
section 3.2.2.
5.2.2 Objectives – Online data mining
Accordingly, in the following paragraphs are presented the technological objectives that have
been deduced by the study of the current trends and practices adapted to the operational
environment of European SMEs active in e-commerce markets.
1. Collection of data from various data sources and its consolidation. Our aim is to collect
relevant data from various internal and external data sources of an e-shop, e.g.
information of customer behaviour, customer attributes, the e-shop, competitors and
their products. In order for the data being analysed, the data need to be consolidated
and made interpretable. Here, an elementary data model which includes the relevant
Grant Agreement 315637
PUBLIC
Page 118 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
concepts and their relationships is developed and considered as a basis for
implementing the business intelligence algorithms.
2. Collection of information of competitors and their products. For small e-shops, not only
the internal view on the e-shop, e.g. content and navigation structure, and its visitors
play an important role, also external aspects, e.g. information about competitors and
prices, are crucial to monitor when trying to optimize the own e-shop. Therefore, SME
E-COMPASS develops mechanisms which enable the e-shop owners to identify and
collect relevant information of competitors in the Web, such as product prices. Those
mechanisms are integrated in the SME E-COMPASS cockpit ECC and made available to
the other modules of the online data mining service.
3. Business Scorecard – optimization potential analysis. Current Web analytics solutions
base their analyses on the data which are received in the context of the e-shop. The
interpretation of the numerous different types of data and its visualization is quite
complicated and needs to be done by the e-shop owners themselves if they do not
want to spend some money for an advisor. Therefore, we aim to develop a target
group specific Business Scorecard which provides owners of small e-shops new insights
in their activities and an overview over new optimization potentials by analysing the
internal and external data from various sources in addition to the existing web
analytics information. In this context, data mining techniques (see section Error!
Reference source not found. Error! Reference source not found.) are applied in order
to receive new insights from the enlarged data source.
4. Automated procedures by applying rule-based actions. Usually for owners of small
e-shops, the monitoring of all crucial internal and external metrics becomes complex.
In order to facilitate the monitoring process of relevant metrics and certain patterns,
e.g. competitors reduce the price for a specific product and the number of own visitors
or even buyers is decreasing, a rule-based solution is designed and implemented which
additionally allows to define automated actions which are initiated when certain
situations (recognized patterns within the enlarged data source) occur. The actions
need to be defined in workshops with the target group of owners of small e-shops.
5. Visualization of the results in the SME E-COMPASS cockpit. In order to be able to
configure the services, e.g. which competitors need to be observed and which
products are relevant, and present the BI results of the different analyses, the SME ECOMPASS cockpit is designed. The cockpit features defined interface (APIs) which
allows the exchange of information from the cockpit to the different service modules
which are implemented.
6. Software-as-a-service application. Similar to the anti-fraud use case, our vision of the
online data mining services is to create a web-based service which provides the
additional features, information and results to the owners of small e-shops. The SME
E-COMPASS cockpit is beneficial and used next to existing and applied web analytics
tools.
Grant Agreement 315637
PUBLIC
Page 119 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
5.3 Integration Framework for the Design Process
The higher integration task in the project is to develop a RDF repository which integrates all
required data from different-format data sources and making them available to the services
developed into the project (anti-fraud and data mining for e-sales). This RDF repository
integrates all the required data using RDF as the data model. Figure 17 depicts how the
repository is integrated within the two service applications.
Integrating data from multiple heterogeneous sources entail dealing with different data
models, schemas and query languages. An OWL ontology will be used as mediated schema for
the explicit description of the data source semantics, providing a shared vocabulary for the
specification of the semantics. In order to implement the integration, the meaning of the
source schemas has to be understood. Therefore, we will define mappings between the
developed ontology and the source schemas.
In case of online fraud application, the aim of the RDF repository is to make data from
different-format data-sources available to the anti-fraud algorithms. Data translators from RDF
to other formats will be developed when necessary, enabling the interchange of data among
algorithms dealing with different data models. Results of the algorithms will be also stored in
the RDF repository to make them also available to the rest of algorithms.
In case of data mining for e-sales, the RDF repository stores data about online transactions and
user registries, to produce integrated data. These integrated RDF data will be translated to a
format that data mining tools can understand (for example, ARFF as used in Weka platform) to
enable the analysis of the data.
Grant Agreement 315637
PUBLIC
Page 120 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Figure 17 The RDF repository and its relations with the Project Work Packages
6 APPENDIX
6.1 Web analytics techniques (for visitors behaviour analysis)
E-shop owners can apply various methods and techniques to conduct web analytics:
1. Web server logfile analysis: web servers record some of their transactions in a logfile
which can be read and analysed toward certain attributes of e-shop visitors.
Initially, web site statistics consisted primarily of counting the number of client
requests (or hits) made to the web server. HTML files without images have been
reasonably counted. However, with the introduction of images in HTML, and web sites
that spanned multiple HTML files, this count became less useful since opening one
HTML file caused an undefined number of requests.
Therefore, the two measures of page views and visits (or sessions) have been
introduced. A page view was defined as a request made to the web server for a page,
as opposed to a graphic, while a visit was defined as a sequence of requests from a
uniquely identified client that expired after a certain amount of inactivity, usually 30
minutes. The page views and visits are still commonly displayed metrics.
The emergence of search engine spiders and robots, along with web proxies and
Grant Agreement 315637
PUBLIC
Page 121 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
dynamically assigned IP addresses for large companies and ISPs, made it more difficult
to identify unique human visitors to a web site. Log analysers responded by tracking
visits by cookies, and by ignoring requests from known spiders.
The extensive use of web caches also presented a problem for logfile analysis. If a
person revisits a page, the second request will often be retrieved from the browser's
cache, and so no request will be received by the web server. This means that the
person's path through the site is lost. Caching can be defeated by configuring the web
server, but this can result in degraded performance for the visitor and bigger load on
the servers.
Identification of recurring visitors
In order to keep track on a user’s activities on a specific Website small text messages,
i.e. Cookies, are transmitted by the web server to the web browser. The visitors'
browser stores the cookie information on the hard drive so when the browser is closed
and reopened at a later date, the cookie information is still available. These are known
as persistent cookies. Cookies that only last a visitors' session are known as session
cookies.
By applying cookies an e-shop is able to anonymously identify users for later use –
most often a visitor ID number. By analysing the cookies, the e-shop can determine
how many first time or recurring visitors a site has received, how many times a
visitor returns each period and what is the length of time between visits. By
identifying a certain visitor, a web server can present visitor-specific web pages, i.e. a
recurring visitor may be presented different content than a first time visitor. If the
visitors register and login to an e-shop, further cookie information may be used to
personalise the information presented in the e-shop.
Two types of cookies are differentiated: first-party and third-party. A first-party cookie
is created by the web site someone is currently viewing. A third-party cookie is sent
from a web site different from the one someone is currently visiting. The major idea is
that the transfer of cookie information takes place behind the scenes without the user
having to know/worry about it. However, this does mean cookies have implications
which are relevant to a user's privacy and anonymity on the web.
From a web analytics point of view, cookie information is very crucial. Since many antispy programs and firewalls exist which blocks third-party cookies, the e-shop owners
should only apply first-party cookies otherwise they mangle the collected analytic
data. End-users are also becoming much more 'cookie savvy' and will delete cookies
manually or set their browser settings so as to reject third party cookies automatically.
Recent studies have indicated that as many as 30% of users delete cookies within 30
days. Firefox defaults to a limit of 50 cookies per site and 1000 total.
Grant Agreement 315637
PUBLIC
Page 122 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
An alternative to cookies are fingerprinting techniques. In this case, a great variety of
technical information about a user’s IT-environment is gathered, e.g. provider, screen
resolution, installed plugins, and aggregated to an individual profile. Inaccuracies occur
when visitors change their hard- and software, e.g. delete or add new plugins, or other
users feature a similar individual profile/fingerprint.
Figure 18: Techniques applied for recognizing recurring visitors (Bauer et al., 2011)
2. Page tagging: Concerns about the accuracy of logfile analysis while browsers apply
caching techniques, and the requirement to integrate web analytics as an cloud
service, let the second data collection method emerge, page tagging or 'web bugs'.
In the past, web counters, i.e. images included in a web page that showed the number
of the image’s requests as an estimate of the number of visits to that page, were
commonly used. Later on, a small invisible image has been used with JavaScript to pass
along certain information about the page and the visitor with the image request. This
information can then be processed and visualized by a web analytics service.
The web analytics service also needs to process a visitor’s cookies, which allow a
unique identification during his visit and in subsequent visits. However, cookie
acceptance rates significantly vary between Websites and may affect the quality of
data collected and reported.
Collecting web site data by applying third-party cookies and a third-party data
collection server requires an additional DNS look-up by the visitor's computer to
determine the IP address of the collection server. In this case, delays in completing a
successful or failed DNS look-ups may occasionally result in data not being collected.
With the increasing popularity of Ajax-based solutions, an alternative to the use of an
invisible image is to implement a call back to the server from the rendered page. In
this case, when the page is rendered on the web browser, a piece of Ajax code would
call back to the server and pass information about the client that can then be
Grant Agreement 315637
PUBLIC
Page 123 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
aggregated by a web analytics service. This is in some ways flawed by browser
restrictions on the servers which can be contacted with XmlHttpRequest objects. Also,
this method can lead to slightly lower reported traffic levels, since the visitor may stop
the page from loading in mid-response before the Ajax call is made.
Hybrid methods
Some companies produce solutions that collect data through both logfiles and page tagging
and can analyse both kinds. By using a hybrid method, they aim to produce more accurate
statistics than either method on its own.
6.2 Metrics for customer behaviour analysis
In Error! Reference source not found.1 the metrics which are used by the web Analytics
Association are listed (Web Analytics Association, 2008).
Building blocks
Page
Page View
Visits (Sessions)
Unique Visitors
Event
Visit Characterization Terms
Entry Page
Landing Page
Exit Page
Visit Duration
Referrer
Page Referrer
Session Referrer
Click-through
Click-through Rate/Ratio
Visitor Characterization
New Visitor
Return(ing) Visitor
Repeat Visitor
Visitor Referrer (Original Referrer or Initial Referrer)
Visits per Visitor
Recency
Frequency
Engagement Terms
Page Exit Ratio
Single Page Visits (Bounces)
Bounce Rate
Page Views per Visit
Grant Agreement 315637
PUBLIC
Page 124 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Conversion Terms
Conversion
Conversion Rate
Miscellaneous Terms
Hit (AKA Server Request or Server Call)
Impressions
Table 11: web Analytics Metrics by the web Analytics Association (Web Analytics Association, 2008)
At Table 12 the metrics which is considered by ibi research is listed (Bauer et al., 2011). They
have considered some more and different metrics when discussing the issues of web Analytics.
Thus, the metrics of ibi research is introduced here.
Ibi research
Information of the visitors’ origin (Visit Characterization Terms)
Most common entry pages
Websites which refer the visitors to the eShop
Common key words which are used in search engines by the visitors of an eShop
Common search phrases which are used in search engines by the visitors of an eShop
Geographical origin of the visitors (e.g. country, region, town)
Information of visitors’ attributes (Visitor Characterization)
Number of new visitors
Number of recurring visitors
Number of visits per visitor (visitor loyalty)
Numbers of visitors per week
Technical equipment of the visitors (e.g. browser-version)
Information of visitors’ behaviour (Engagement Terms)
Most common exit pages
Most common page view sequence (click paths)
Number of page views per visit (depth of visits)
Pages which are most often viewed
Time of stay per visit (duration of visits)
Applied key words within the own eShop search
Information of purchasing behaviour (Conversion Terms)
Number of visitors who did a purchase
Average value of a shopping cart of the eShop
Number of visitors who put a product into the basket
Number of visitors who break up the check out (purchasing) process
Average time of stay in the eShop until purchasing products
Average number of clicks until purchasing products
New function for monitoring
Analysis of the access of mobile devices
Categorization of user groups (visitors segmentation)
Qualitative user survey
Page-oriented user feedback
Form field analysis
Comparative tests (e.g. A/B-tests)
Grant Agreement 315637
PUBLIC
Page 125 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Mouse-tracking
Analysis of user behaviour for videos
Table 12: web Analytics Metrics by ibi research (Bauer et al., 2011)
Those metrics can be enhanced by applying data mining techniques and enrich them by
mapping other valuable data, e.g. an IP-address can be translated into a region from where a
visitor comes or the content of a page from which an e-shop is frequently exited can be
extracted and analysed for optimization potential.
Grant Agreement 315637
PUBLIC
Page 126 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
6.3 A classification of empirical studies employing state-of-the art
fraud detection technologies
All quoted papers are given in chronological order and grouped with respect to the type of technique(s)
employed. Studies that present comparative results from the application of several anti-fraud
technologies typically appear in multiple entries of the table. See aslo Fawcett et al. (1998), Bolton and
Hand (2001), Hodge and Austin (2004), Kou et al. (2004), Phua et al. (2005), Delamaire et al. (2009),
Sudjianto et al. (2010), Ngai et al. (2011) and Behdad et al. (2012) for recent reviews of research papers
dealing with automatic fraud detection. Bolton and Hand (2002) is a good guide to the statistical
literature, while Fawcett et al. (1998) and Behdad et al. (2012) focus more on modern artificial
intelligence or nature inspired paradigms.
Table 13: A Classification of empirical studies employing state-of-the-art fraud detection technologies
Method
Studies
Expert systems
Leonard (1995)
Stefano and Gisella (2001)
Pathak et al. (2005)
Statistical
techniques
(regression Shen et al. (2007)
models, discriminant analysis, etc)
Whitrow et al. (2009)
Brabazon et al. (2010)
Lee et al. (2010)
Bhattacharyya et al. (2011)
Jha et al. (2012)
Louzada and Ara (2012)
Network-type classifiers
Ghosh and Reily (1994)
Hanagandi et al. (1996)
Aleskerov et al. (1997)
Dorronsoro et al. (1997)Brause
et al. (1999)
Kim and Kim (2002)
Maes et al. (2002)
Chen et al. (2005)
Shen et al. (2007)
Xu et al. (2007)
Grant Agreement 315637
PUBLIC
Page 127 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Method
Studies
Gadi et al. (2008)
Robinson et al. (2011)
Louzada and Ara (2012)
Sahin et al. (2013)
Support vector machines
Chen et al. (2004, 2005, 2006)
Xu et al. (2007)
Whitrow et al. (2009)
Bhattacharyya et al. (2011)
Sahin and Duman (2011)
Hejazi and Singh (2013)
Sahin et al. (2013)
Bayesian learners
Stolfo et al. (1997)
Prodromidis and Stolfo (1999)
Prodromidis et al. (2000)
Maes et al. (2002)
Xu et al. (2007)
Gadi et al. (2008)
Panigrahi et al. (2009)
Whitrow et al. (2009)
Louzada and Ara (2012)
Decision-tree induction techniques
Stolfo et al. (1997)
Prodromidis and Stolfo (1999)
Prodromidis et al. (2000)
Shen et al. (2007)
Xu et al. (2007)
Gadi et al. (2008)
Whitrow et al. (2009)
Bhattacharyya et al. (2011)
Sahin and Duman (2011)
Sahin et al. (2013)
Rule-induction techniques
Stolfo et al. (1997)
Prodromidis and Stolfo (1999)
Grant Agreement 315637
PUBLIC
Page 128 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Method
Studies
Prodromidis et al. (2000)
Fan et al. (2001)
Xu et al. (2007)
Robinson et al. (2011)
Anomaly
detectors/
unsupervised Fan et al. (2001)
learning techniques
Bolton and Hand (2001)
Chen et al. (2006)
Zaslavsky and Strizhak (2006)
Ferdousi and Maeda (2007)
Xu et al. (2007)
Juszczak et al. (2008)
Quah and Sriganesh (2008)
Weston et al. (2008)
Kundu et al. (2009)
Lee et al. (2013)
Hejazi and Singh (2013)
Nature-inspired techniques
Bentley et al. (2000)
Kim et al. (2003)
Wightman (2003)
Tuo et al. (2004)
Chen et al. (2006)
Gadi et al. (2008)
Brabazon et al. (2010)
Ozcelik et al. (2010)
Duman and Ozcelik (2011)
Wong et al. (2011)
Hybrid architectures
Stolfo et al. (1997)
Chan et al. (1999)
Prodromidis and Stolfo (1999)
Prodromidis et al. (2000)
Stolfo et al. (2000)
Wheeler and Aitken (2000)
Grant Agreement 315637
PUBLIC
Page 129 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Method
Studies
Syeda et al. (2002)
Park (2005)
Chen et al. (2006)
Gadi et al. (2008)
Kundu et al. (2009)
Panigrahi et al. (2009)
Krivko (2010)
Duman and Ozcelik (2011)
Robinson et al. (2011)
Ryman-Tubb and Krause (2011)
Lei and Ghorbani (2012)
Grant Agreement 315637
PUBLIC
Page 130 of 144
SME E-COMPASS
7
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
References
1.
2.
Abbass, H. A., Bacardit, J., Butz, M. V., Llorà, X. (2004), “Online adaptation in
learning classifier systems: stream data mining”, Technical Report 200403, Illinois
Genetic Algorithms Lab (IlliGAL).
Adams, N. (2009), “Credit card transaction fraud detection ”
3.
Agyemang, M., Barker, K., Alhajj, R. (2006), “A comprehensive survey of
numeric and symbolic outlier mining techniques”, Intelligent Data Analysis 10 (6),
pp. 521–538.
4.
Aleskerov, E., B. Freisleben and B. Rao (1997) “CARDWATCH: A Neural
Network Based Database Mining System for Credit Card Fraud Detection,” in
Proceedings of the IEEE/IAFE: Computational Intelligence for Financial Eng., pp.
220-226.
5.
Alexopoulos, P., Kafentzis, K., Benetou, X., Tagaris, T., and Georgolios, P.
(2007), "Towards a Generic Fraud Ontology in e-Government". ICE-B, page 269-276.
INSTICC Press.
6.
Ansari, S., Kohavi, R., Mason, L., and Zheng, Z. (2001), “Integrating ECommerce and Data Mining: Architecture and Challenges”. In Proceedings of the
2001 IEEE International Conference on Data Mining (ICDM '01), Nick Cercone, Tsau
Young Lin, and Xindong Wu (Eds.). IEEE Computer Society, Washington, DC, USA,
27-34.
7.
Astudillo, C., Bardeen M., and Cerpa N. (2014), “Data Mining in Electronic
Commerce‐Support vs. Confidence”. Journal of Theoretical and Applied Electronic
Commerce Research 9:1, editorial.
8.
Axelsson, S. (2000), “The Base-Rate Fallacy and the Difficulty of Intrusion
Detection”, ACM Transactions on Information and System Security 3(3), pp. 186–
205.
9.
Ayada, W. M., & Elmelegy, N. A. (2014). "Advergames on Facebook a new
approach to improve the Fashion Marketing". International Design Journal, 2(2),
139–151. Retrieved from http://www.journal.faa-design.com/pdf/2-2-ayada.pdf
10.
Bauer, C., Wittmann, G., Stahl, E., Weisheit, S., Pur, S., and Weinfurtner S.
(2011) "So steigern Online" - Händler ihren Umsatz. Fakten aus dem deutschen
Online - Handel; Aktuelle Ergebnisse zu Online - Marketing und Web - Controlling
aus dem Projekt E - Commerce - Leitfaden. ibi-research an der Univ. Regensburg,
Regensburg.
11.
Behdad, M., Barone, L., Bennamoun, M., French, T., (2012), “Nature-Inspired
Techniques in the Context of Fraud Detection," IEEE Transactions on Systems, Man,
and Cybernetics, Part C: Applications and Reviews 42 (6), pp.1273-1290.
Grant Agreement 315637
PUBLIC
Page 131 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
12.
Bentley P., Kim, J., Jung. G. & J Choi. (2000), “Fuzzy Darwinian Detection of
Credit Card Fraud”, in Proceedings of the 14th Annual Fall Symposium of the Korean
Information Processing Society, pp. 1-4.
13.
Berry, M.J., and Linoff, G. (2011), “Data Mining Techniques: For Marketing,
Sales, and Customer Support”. 3ª Edition. John Wiley & Sons, Inc., New York, NY,
USA. 2011. ISBN: 978-0-471-47064-9
14.
Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J. Ch. (2011), “Data
mining for credit card fraud: A comparative study”, Decision Support Systems (50)
3, pp. 602-613.
15.
Bishop, J. (2007). “Increasing participation in online communities: A
framework for human-computer interaction”. Computers in Human Behavior
(Elsevier Science Publishers) 23 (4): 1881–1893.
16.
Bolton, R. J., Hand, D. J. (2001), “Unsupervised profiling methods for fraud
detection”, in Proceeding of Credit Scoring and Credit Control VII , pp. 5-7.
17.
Bolton, R. J., Hand, D. J. (2002), “Statistical fraud detection: a review”,
Statistical Science 17 (3), pp 235–255.
18.
Brabazon, A., Cahill, J., Keenan, P., Walsh, D. (2010), “Identifying online credit
card fraud using Artificial Immune Systems”, in Proceedings of the 2010 IEEE
Congress on Evolutionary Computation (CEC 2010), pp. 1-7.
19.
Brause, R., Langsdorf, T. and Hepp, M. (1999) “Neural data mining for credit
card fraud detection”, In Proceedings of the 11th IEEE International Conference on
Tools with Artificial Intelligence.
20.
Bundesverband Digitale Wirtschaft (BVDW), e.V., (2012) “Overall, what
percentage of your shopping would you say you do online?”
http://de.statista.com/statistik/daten/studie/248424/umfrage/Anteil-der-OnlineKäufe-an-den-Gesamtkäufen-(nach-Altersgruppen)/.
21.
bvh,
2013a:
Interaktver
Handel
in
Deutschland.
http://www.bvh.info/uploads/media/140218_Pressepr%C3%A4sentation_bvh-B2CStudie_2013.pdf.
22.
bvh,
2013b:
Umsatzstarke
Warengruppen
im
Online-Handel.
http://de.statista.com/statistik/daten/studie/253188/umfrage/UmsatzstarkeWarengruppen-im-Online-Handel-in-Deutschland/.
23.
Buchholtz, S., Bukowski, M., Śniegocki, A. (2012), “Big and open data in
Europe. A growth engine or a missed opportunity?” A report commissioned by
demos EUROPA – Centre for European Strategy Foundation within the “Innovation
and entrepreneurship” programme. ISBN: 978-83-925542-1-9
24.
Burge, P. , Shawe-Taylor, J. (1997), “Detecting cellular fraud using adaptive
prototypes”, In Proceedings on the AAAI Workshop on Al Approaches to Fraud
Detection and Risk Management, pp. 9-13.
Grant Agreement 315637
PUBLIC
Page 132 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
25.
Çakir, A., Çalics, H., & Küçüksille, E. U. (2009). "Data mining approach for
supply unbalance detection in induction motor". Expert Systems with Applications,
36(9), 11808–11813.
26.
Carmona, C., Ramírez-Gallego, S., Torres, F., Bernal, E., Del Jesus, M., and
García S. (2012), “Web usage mining to improve the design of an e-commerce
website: OrOliveSur.com”. Expert Systems with Applications 39(12): 11243–11249.
27.
Chan, P., Stolfo, S. (1998), “Toward scalable learning with non-uniform class
and cost distributions: A case study in credit card fraud detection”, in Proceedings
of the Fourth International Conference on Knowledge Discovery and Data Mining,
AAAI Press, Menlo Park, CA, pp. 164-168.
28.
Chan, P., Fan, W., Prodromidis, A., Stolfo, S. (1999), “Distributed data mining
in credit card fraud detection”, IEEE Intelligent Systems 14(6), pp 67-74.
29.
Chan, P., Stolfo, S. (1993), “Meta-learning for multistrategy and parallel
learning”, In Proceedings of the Second Intl. Work. On Multistrategy Learning,pp.
150-165.
30.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., &
Wirth, R. (2000). "CRISP-DM 1.0 Step-by-step data mining guide". The CRISP-DM
consortium
31.
Charlton, G. (2013), “E-Commerce: Where Next?”
32.
Chawla, N. (2010), “Data Mining for Imbalanced Datasets: An Overview”, in
Maimon, O. and Rokach, L. (eds), Data Mining and Knowledge Discovery Handbook,
Springer, pp. 853-867.
33.
Chen, H., Chiang, R. H. L., and Storey, V. C. (2012), “Business Intelligence and
Analytics: From Big Data to Big Impact”. MIS Quarterly 36(4):1165-1188.
34.
Chen, R., Chiu, M., Huang, Y., and Chen, L. (2004), "Detecting credit card
fraud by using questionnaire-responded transaction model based on SVMs". In
Proceedings of IDEAL2004 (pp. 800–806). Exeter, UK.
35.
Chen, R., Luo, S.-T., Liang, X. and Lee, V. C. S. (2005) “Personalized approach
based on SVM and ANN for detecting credit card fraud”, In Proceedings of the IEEE
International Conference on Neural Networks and Brain, Beijing, China
36.
Chen, R.C., Chen, T.S., Lin, C.C. (2006), “A new binary support vector system
for increasing detection rate of credit card fraud”, International Journal of Pattern
Recognition and Artificial Intelligence 20 (2), pp. 227-239
37.
Chiu, Ch-Ch., Tsai, Ch-Y. (2004), “A Web Services-Based Collaborative Scheme
for Credit Card Fraud Detection”, in Proceedings of the 2004 IEEE International
Conference on e-Technology, e-Commerce and e- Service, pp.177-181.
38.
Cooley, R., Mobasher, B., & Srivastava, J. (1997). "Web mining: Information
and pattern discovery on the world wide web". In Tools with Artificial Intelligence,
1997. In Proceedings of Ninth IEEE International Conference. pp. 558–567.
Grant Agreement 315637
PUBLIC
Page 133 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
39.
Concolato, C., Schmitz, P., (Eds.) (2012), "ACM Symposium on Document
Engineering", DocEng '12, Paris, France, September 4-7, ACM 2012.
40.
Cortes, C, Pregibon, D., Volinsky, Ch. (2001), “Communities of interest”,
Advances in Intelligent Data Analysis, Lecture Notes in Computer Science 2189, pp.
105-114.
41.
Damiani E., Uden, L, & Wangsa, T. (2007), "The future of E-learning: Elearning ecosystem." Inaugural Digital EcoSystems and Technologies Conference,
IEEE DEST'07
42.
Delamaire, L., H. Abdou and J. Pointon (2009), “Credit card fraud and
detection techniques: a review”, Banks and Bank Systems 4 (2), pp. 57-68.
43.
Dhanabhakyam, M. and Punithavalli, M. (2011), “A Survey on Data Mining
Algorithm for Market Basket Analysis”. Global Journal of Computer Science and
Technology. 11(11)
44.
Dorronsoro, J.R. , Ginel, F. Sanchez, C. Cruz, C.S. (1997), “Neural fraud
detection in credit card operations”, IEEE Transactions on Neural Networks 8 (4),
pp. 827–834.
45.
Dukino, C. and H. Kett, H., (2014), "Marktstudie: Untersuchung von
Webbasierten Ökosystemen und ihrer Relevanz für kleine und mittlere
Unternehmen", Stuttgart.
46.
Duman, E., Ozcelik, M. (2011), “Detecting credit card fraud by genetic
algorithm and scatter search”, Expert Systems with Applications (38), 10, pp. 1305713063.
47.
Dziczkowski, G.; Wegrzyn-Wolska, K.; Bougueroua, L. (2013), “An opinion
mining approach for web user identification and clients' behaviour analysis”.
Computational Aspects of Social Networks (CASoN), Fifth International Conference
pp.79,84, 12-14
48.
EHI Retail Institute, Statista, (2013), Umsatzanteil der Top-Online-Shops.
http://de.statista.com/statistik/daten/studie/203792/umfrage/Umsatzanteil-dergrößten-Online-Shops-in-Deutschland/.
49.
Elizabeth, V., (2014) “The Best of the Best in Ecommerce Trends for 2014”,
METRIA.
50.
Elkan, Ch. (2001), “The foundations of cost-sensitive learning”, in Proceedings
of the Seventeenth International Joint Conference on Artificial Intelligence
(IJCAI’01), pp. 973-978.
51.
eMarketer,
(2013b),
“E-Commerce
Umsatz
weltweit”.
http://de.statista.com/statistik/daten/studie/187663/umfrage/E-CommerceUmsatz-weltweit-nach-Regionen/.
Grant Agreement 315637
PUBLIC
Page 134 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
52.
eMarketer, (2013c), “Entwicklung des B2C-E-Commerce-Umsatzes in
Europa”. http://de.statista.com/statistik/daten/studie/2813/umfrage/Entwicklungdes-B2C-E-Commerce-Umsatzes-in-Europa/.
53.
Fan, W. (2004), “Systematic Data Selection to Mine Concept-Drifting Data
Streams”, in Proceedings of SIGKDD04, pp. 128-137.
54.
Fan, W., Miller, M., Stolfo, S., Lee, W. & P Chan (2001), “Using Artificial
Anomalies to Detect Unknown and Known Network Intrusions”, in Proceedings of
the ICDM01, pp. 123-248.
55.
Fang, L., Cai, M., Fu, H., and Dong, J. (2007), "Ontology-Based Fraud
Detection". Computational Science – ICCS 2007. Lecture Notes in Computer Science
4489: 1048-1055.
56.
Fawcett, T., Haimowitz, I., Provost, F., Stolfo, S. (1998), “AI Approaches to
Fraud Detection and Risk Management”, AI Magazine 19 (2), pp. 107-108.
57.
Ferdousi, Z., Maeda, A. (2007), “Anomaly Detection Using Unsupervised
Profiling Method in Time Series Data”, in Proceedings of the 10th East-European
Conference on Advances in Databases and Information Systems (ADBIS-2006),
available from http://ceur-ws.org/Vol-215.
58.
Gadi, M. , Wang, X., Pereira do Lago, A. (2008), “Credit card fraud detection
with artificial immune system”, in Bentley, P. J., Lee, D., Jung, S. (eds), Artificial
Immune Systems, Lecture Notes in Computer Science 5132, Springer Berlin
Heidelberg, pp. 119-131.
59.
Ge, Y., Xiong, H., Tuzhilin, A., and Liu, Q. (2014), “Cost-Aware Collaborative
Filtering for Travel Tour Recommendations”. ACM Trans. Inf. Syst. 32, 4, 31 pages.
60.
Ghosh, S., Reilly, D.L. “Credit Card Fraud Detection with a Neural-Network,”
in Proceedings of the 27th Hawaii International Conference on System Sciences 3,
pp. 621-630.
61.
Gomez-Perez A., Oscar C., and Fernandez-Lopez, M. (2004), "Ontological
Engineering". Springer-Verlang London Limited.
62.
Goodwin P. (2002), “Integrating management judgment and statistical
methods to improve short-term forecasts”, Omega 30 (2), pp. 127-135.
63.
Grasso, G., Furche, T., and Schallhart, C., (2013), “Effective web scraping with
OXPath”. In: Proceedings of the 22nd international conference on World Wide Web
companion, pp. 23–26.
64.
Gruber T.R. (1993), "A translation approach to portable ontology
specification". Knowledge Acquisition 5(2):1999-220.
65.
Han, J., Kamber, M., & Pei, J. (2006). "Data mining: concepts and techniques".
Morgan kaufmann.
Grant Agreement 315637
PUBLIC
Page 135 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
66.
Hanagandi, V., Dhar, A. and Buescher, K. (1996), “Density-Based Clustering
and Radial Basis Function Modeling to Generate Credit Card Fraud Scores”, In
Proceedings of the IEEE/IAFE 1996 Conference.
67.
Hand, D. (2006), “Classifier technology and the illusion of progress”,
Statistical Science 21 (1), pp. 1–15.
68.
Hand, D. (2007), “Statistical techniques for fraud detection, prevention and
evaluation”, invited lecture in the NATO Advanced Study Institute’s Workshop in
Mining Massive Data Sets for Security (MMDSS07), September 10 - 21, 2007, Villa
Cagnola - Gazzada – Italy
69.
Hand D. (2009), "A (personal) view of statistical issues in (mainly retail) credit
risk assessment". OCC - NISS meeting 5-6 Feb 09
70.
Hand, D., Whitrow, C., Adams, N., Juszczak, P., Weston, D. (2008),
“Performance criteria for plastic card fraud detection tools”, Journal of the
Operational Research Society 59, pp. 956 -962.
71.
Hassler,
M.,
(2012),
“Web
Analytics”.
Metriken
auswerten,
Besucherverhalten verstehen, Website optimieren, 3rd edn. Mitp, Heidelberg [u.a.].
72.
Hejazi, M., Singh, Y. P. (2013), “One-Class Support Vector Machines Approach
To Anomaly Detection”, Applied Artificial Intelligence: An International Journal
27(5), pp. 351–366.
73.
Hesse, J., (2013), “Seven e-commerce trends to look out for in 2014”.
74.
Hodge, V., Austin, J. (2004), “A Survey of Outlier Detection Methodologies”,
Artificial Intelligence Review 22 (2), pp. 85–126.
75.
Hu, B., Carvalho, N., Laera, L., Lee, V., Matsutsuka, T., Menday, R., Naseer, A.
(2012), "Applying Semantic Technologies to Public Sector: A Case Study in Fraud
Detection". JIST 2012: 319-325
76.
Hunt, J., Cooke, D. (1996), “Learning using an artificial immune system”,
Journal of Network and Computer Applications 19 (2), pp. 189-212.
77.
Institut für Demoskopie Allensbach, (2013), “Anteil der Online-Käufer in
Deutschland
bis
2013”.
http://de.statista.com/statistik/daten/studie/2054/umfrage/Anteil-der-OnlineKäufer-in-Deutschland/
78.
Jha, S., Guillen, M., Westland, J. Ch. (2012), “Employing transaction
aggregation strategy to detect credit card fraud”, Expert Systems with Applications
(39) 16, pp. 12650-12657.
79.
Juszczak, P., Adams, N. M., Hand, D. J., Whitrow, C., and Weston, D. J. (2008),
"Off-the-peg and bespoke classifiers for fraud detection. Computational Statistics &
Data Analysis. 52(9).
Grant Agreement 315637
PUBLIC
Page 136 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
80.
Kandel, S., Paepcke, A., Hellerstein, J. M., and Heer, J. (2012), “Enterprise
Data Analysis and Visualization: An Interview Study”. IEEE Trans. Visual. Comput.
Graphics 18, 2917–2926.
81.
Kandula, S. and Communication, ACM Special Interest Group on Data. (2012),
Proceedings of the 11th ACM Workshop on Hot Topics in Networks. ACM, [S.l.].
82.
Kawabe, T., Yamamoto, Y., Tsuruta, S., Damiani, E., Yoshitaka, A., and
Mizuno, Y. (2013), “Digital eco-system for online shopping”. In Proceedings of the
Fifth International Conference on Management of Emergent Digital EcoSystems
(MEDES '13). ACM, New York, NY, USA, 33-39.
83.
Kim, J., Bentley, P., Aickelin, U., Greensmith, J., Tedesco, G., Twycross, J.
(2007), “Immune system approaches to intrusion detection – a review”, Natural
Computing 6 (4), pp. 413-466.
84.
Kim, J., Ong, A., Overill, R. (2003), “Design of an Artificial Immune System as a
Novel Anomaly Detector for Combating Financial Fraud in the Retail Sector”, in
Proceedings of the 2003 Congress on Evolutionary Computation (CEC '03), vol.1,
pp.405-412.
85.
Kim, M., Kim, T. (2002), “A Neural Classifier with Fraud Density Map for
Effective Credit Card Fraud Detection”, in Proceedings of the 3rd International
Conference on Intelligent Data Engineering and Automated Learning, SpringerVerlag, pp. 378-383.
86.
Kingston, J., Schafer, B., Vandenberghe, W. (2003), "No Model Behaviour
Ontologies for Fraud Detection". Law and the Semantic Web, 3369, page 233-247.
87.
Kohavi, R. (2001), “Mining e-commerce data: the good, the bad, and the
ugly”. In Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining (KDD '01). ACM, New York, NY, USA, 8-13.
DOI=10.1145/502512.502518 http://doi.acm.org/10.1145/502512.502518
88.
Kohavi, R., Mason, L., Parekh, R., Zheng, Z. (2004), “Lessons and Challenges
from Mining Retail E-Commerce Data”. Machine Learning. Springer, 57(1-2): 83113.
89.
Kotsiantis, S. Kanellopoulos, D. Pintelas, P. (2006), “Handling imbalanced
datasets: a review”, GESTS International Transaction in Computer Science and
Engineering 30 (1), pp. 25–36.
90.
Kou, Y., Lu, C.T., Sirwongwattana, S., Huanq, Y.P. (2004), “Survey of fraud
detection techniques”, in Proceedings of the IEEE International Conference on
Networking, Sensing and Control, March 21-23 2004, Taiwan , pp. 749–754.
91.
Krivko, M. (2010), “A hybrid model for plastic card fraud detection systems”,
Expert Systems with Applications 37 (8), pp. 6070–6076.
Grant Agreement 315637
PUBLIC
Page 137 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
92.
Kundu, A., Panigrahi, S., Sural, S., Majumdar, A.K., (2009), “BLAST-SSAHA
Hybridization for Credit Card Fraud Detection”, IEEE Transaction on Dependable
and Secure Computing 6(4), pp.309,315
93.
Kumar, L., Singh, H., and Kaur, R. (2012), “Web Analytics and Metrics: A
Survey”. In: Proceedings of the International Conference on Advances in
Computing, Communications and Informatics. New York, NY, USA, ACM, pp. 966–
971.
94.
lebensmittelzeitung.net, (2013), Besucherzahlen von Online-Shops.
http://de.statista.com/statistik/daten/studie/158229/umfrage/Online-Shops-inDeutschland-nach-Besucherzahlen/.
95.
Lee, B., Cho, H., Chae, M., Shim, S., (2010), “Empirical analysis of online
auction fraud: credit card phantom transactions”, Expert Systems with Applications
37 (4), pp. 2991–2999.
96.
Lee, R.S.T., Liu, J.N.K. (2004), “iJADE Web-miner: an intelligent agent
framework for Internet shopping”. IEEE Transactions. On Knowledge And Data
Engineering, 16(4): 461 - 473. 2004. DOI: 10.1109/TKDE.2004.1269670.
97.
Lee, Y.-J., Yeh, Y.-R.; Wang, Y.-Ch. F., (2013), “Anomaly Detection via Online
Oversampling Principal Component Analysis”, IEEE Transactions on Knowledge and
Data Engineering 25 (7), pp.1460-1470.
98.
Lei, J., Ghorbani, A. (2012), “Improved competitive learning neural networks
for network intrusion and fraud detection”, Neurocomputing (75) 1, pp. 135-145.
99.
Leonard K. (1995), “The development of a rule based expert system model
for fraud alert in consumer credit”, European Journal of Operational Research 80
(2), pp 350-356.
100.
Linof, G., and Berry, M. J. (2001), “Mining the Web: Transforming Customer
Data into Customer Value”. John Wiley & Sons, 2001. ISBN: 978-0-471-41609-8
101.
Lim, E.-P., Chen H., and Chen, G. (2013), “Business intelligence and analytics:
Research directions”. ACM Transactions on Management Information Systems
(TMIS) 3(4): 17.
102.
Liu, B. and Chen-Chuan-Chang, K. (2004), “Editorial: special issue on web
content mining”. SIGKDD Explor. Newsl. 6, 1–4.
103.
Liu, P., Li, L. (2002), “A game-theoretic approach for attack prediction”,
Technical report, PSU-S2-2002-01, Penn State University.
104.
Louzada, F., Ara, A., (2012), “Bagging k-dependence probabilistic networks:
An alternative powerful fraud detection tool”, Expert Systems with Applications 39
(14), pp. 11583-11592.
105.
MacVittie, L. (2002), “Online fraud detection takes diligence”, Network
Computing, 13 (4), pp. 80-83.
Grant Agreement 315637
PUBLIC
Page 138 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
106.
Maes, S., Tuyls, K., Vanschoenwinkel, B. & B Manderick (2002), “Credit Card
Fraud Detection using Bayesian and Neural Networks”, in Proceedings of the 1st
International NAISO Congress on Neuro Fuzzy Technologies (NF2002).
107.
Markov, Z. and Larose, D. T. (2007), “Data mining the Web: uncovering
patterns in Web content”, structure, and usage. John Wiley & Sons.
108.
Mikut, R., & Reischl, M. (2011). "Data mining tools". Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery, 1(5), 431–443.
109.
Mussman, D. C., Adornato, R. L., Barker, T. B., Katz, R. A., & West, G. L.
(2014). "Methods and system for providing real time offers to a user based on
obsolescence of possessed items". Google Patents.
110.
Ngai, E.W.T., Hu, Y., Wong, Y.H., Chen, Y., Sun, X. (2011), “The application of
data mining techniques in financial fraud detection: A classification framework and
an academic review of literature”, Decision Support Systems 50, pp. 559-569.
111.
Niwa, S., Takuo D. and Honiden S. (2006), “Web Page Recommender System
based on Folksonomy Mining for ITNG ’06 Submissions”. In: Third International
Conference on Information Technology: New Generations (ITNG'06), pp. 388–393.
112.
Obweger, H., Schiefer, J., Suntinger, M., Kepplinger, P., and Rozsnyai, S.
(2011), “User-oriented rule management for event-based applications”. In:
Proceedings of the 5th ACM international conf. on Distributed event-based system,
pp. 39–48.
113.
Oseman, K., Shukor, S., Haris, N., Bakar, F. (2010), “Data Mining in Churn
Analysis Model for Telecommunication Industry”. Journal of Statistical Modelling
and Analytics. Vol. 1 No. 19-27.
114.
Ozcelik, M., Isik, M., Duman, E., Cevik, T. (2010), “Improving a credit card
fraud detection system using genetic algorithm” in Proceedings of the 2010
International Conference on Networking and Information Technology, pp. 436-440.
115.
Ozen, H., & Engizek, N. (2014). "Shopping online without thinking: being
emotional or rational?". Asia Pacific Journal of Marketing and Logistics, 26(1), 78–
93.
116.
Palpanas, T. (2012), “A knowledge mining framework for business analysts”.
SIGMIS Database 43, 1 (February 2012), 46-60.
117.
Panigrahi, S., Kundu, A., Sural, Sh. , Majumdar, A.K., (2009), “Credit card fraud
detection: A fusion approach using Dempster–Shafer theory and Bayesian
learning”, Information Fusion 10 (4), pp. 354-363.
118.
Park, L.J. (2005), “Learning of Neural Networks for Fraud Detection Based on
a Partial Area Under Curve”, in Wang, J. and Liao, X.-F., and Yi, Z. (eds), Advances in
Neural Networks, Lecture Notes in Computer Science (3497), pp. 922-927.
Grant Agreement 315637
PUBLIC
Page 139 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
119.
Patel, K. B., Chauhan, J. A., Patel, J. D. (2011), “Web Mining in E-Commerce:
Pattern Discovery, Issues and Applications”. International Journal of P2P Network
Trends and Technology- 1(3):40-45. ISSN: 2249-2615-1.
120.
Pathak, J., Vidyarthi, N., Summers, S. (2005) "A fuzzy-based algorithm for
auditors to detect elements of fraud in settled insurance claims", Managerial
Auditing Journal (20) 6, pp. 632 – 644.
121.
Pavía, J.M., Veres-Ferrer, E.J., Foix-Escura, G. (2012), “Credit card incidents
and control systems”, International Journal of Information Management (32) 6, pp.
501-503.
122.
Perner, P., & Fiss, G. (2002). "Intelligent E-marketing with web mining,
personalization, and user-adpated interfaces". In Advances in Data Mining. pp. 37–
52. Springer.
123.
Peters, M., (2013) “Was werden die E-Commerce Trends in 2014?”
124.
Phua, C. Lee, V., Smith-Miles, K., Gayler, R. (2005), “A comprehensive survey
of data mining based fraud detection research”, Working paper available from
http://arxiv.org/abs/1009.6119 (Feb 06, 2014).
125.
Pitman, A., Zanker, M., Fuchs, M., & Lexhagen, M. (2010). "Web usage mining
in tourism a query term analysis and clustering approach". In U. Gretzel, R. Law, &
M. Fuchs (Eds.), Information and Communication Technologies in Tourism 2010.
Proceedings of the International Conference in Lugano, Switzerland, pp. 393–403.
126.
Plaza, B. (2011). "Google Analytics for measuring website performance".
Tourism Management, 32(3), 477–481.
127.
Plessas-Leonidis, S., Leopoulos, V., & Kirytopoulos, K. (2010). "Revealing sales
trends through data mining". In Computer and Automation Engineering (ICCAE),
2010. The 2nd International Conference on Vol. 1, pp. 682–687.
128.
Prodromidis, A., Chan, P. K., Stolfo, S. (2000) “Meta-learning in distributed
data mining systems: issues and approaches”, in H. Kargupta and P. Chan (eds.),
Advances of distributed data mining, AAAI Press, ch 3.
129.
Prodromidis, A., Stolfo, S. (1999), “Agent-Based Distributed Learning Applied
to Fraud Detection”, in Proceedings of the Sixteenth National Conference on
Artificial Intelligence, pp. 014-99.
130.
Quah T. S. and Sriganesh M. (2008), “Real-time credit card fraud using
computational intelligence”, Expert Systems with Application, 35(4), pp. 1721-1732.
131.
Rahi, P., and Thakur, J. (2012), “Business Intelligence: A Rapidly Growing
Option through Web Mining”. IOSR Journal of Computer Engineering (IOSRJCE).
ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 1, pp. 22-29.
132.
Rajaraman, A., Leskovec, J., and Ullman, J. D. (2013), "Mining of Massive
Datasets".
Grant Agreement 315637
PUBLIC
Page 140 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
133.
Rajput, Q., Sadaf Khan, N., Larik, A., Haider, S. (2014), "Ontology Based
Expert-System for Suspicious Transactions Detection". Computer and Information
Science, 7, No. 1
134.
Ramaki, A. A., Asgari, R., Atani, R. E. (2012) "Credit Card Fraud Detection
Based on Ontology Graph", International Journal of Security, Privacy and Trust
Management (IJSPTM), Vol. 1, No 5, October 2012, Pages: 1-12.
135.
Rao, T.K.R.K.; Khan, S.A.; Begum, Z.; Divakar, C. (2013), "Mining the Ecommerce cloud: A survey on emerging relationship between web mining, Ecommerce and cloud computing," Computational Intelligence and Computing
Research (ICCIC), 2013 IEEE International Conference on , vol., no., pp.1,4, 26-28
136.
Robinson, N., Graux, H., Parrilli, D., Klautzer, L., Lorenzo, V. (2011),
“Comparative study on legislative and non-legislative measures to combat identity
theft and identity related crime”, Technical Report TR-982-EC, RAND Europe and
Time-lex.
137.
Rönisch, S., (2013) Zukunft E-Commerce: Zwölf Trends für 2014.
138.
Russom, P. (2013), “Integrating Hadoop into Business Intelligence and Data
Warehousing”. Second Quarter 2013. Tdwi Best Practices Report.
http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/integratinghadoop-business-intelligence-datawarehousing-106436.pdf
139.
Ryman-Tubb, N. and Krause, P. (2011), “Neural Network Rule Extraction to
Detect Credit Card Fraud”, in Iliadis, L. and Jayne, Ch. (eds), Engineering
Applications of Neural Networks, IFIP Advances in Information and Communication
Technology 363, Springer Berlin Heidelberg, pp. 101-110.
140.
Sadegh, M., Ibrahim, R., Othman, Z. (2012), “Opinion Mining And Sentiment
Analysis: A Survey”, International Journal of Computers & Technology 2(3):171-178
141.
Sahin, Y., Bulkan, S., Duman, E. (2013), “A cost-sensitive decision tree
approach for fraud detection”, Expert Systems with Applications 40 (15), pp. 59165923.
142.
Sahin, Y., Duman, Ε. (2011) “Detecting Credit Card Fraud by Decision Trees
and Support Vector Machines”, in Proceedings of the International
MultiConference of Engineers and Computer Scientists 2011 Vol I, IMECS 2011, 1618, 2011, Hong Kong.
143.
Shao, J., & Gretzel, U. (2010). "Looking does not automatically lead to
booking: analysis of clickstreams on a Chinese travel agency website". In U. Gretzel,
R. Law, & M. Fuchs (Eds.), Information and Communication Technologies in Tourism
2010. Proceedings of the International Conference. pp. 197–208 Lugano.
144.
Skhiri, S., and Jouili, S. (2012), "Large Graph Mining: Recent Developments,
Challenges and Potential Solutions". eBISS 2012: 103-124.
Grant Agreement 315637
PUBLIC
Page 141 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
145.
Shen, A., Tong, R., Deng, Y. (2007). "Application of classification models on
credit card fraud detection". In International Conference on Service Systems and
Service Management, Chengdu, China, June 2007.
146.
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). "Web usage
mining: Discovery and applications of usage patterns from web data". ACM SIGKDD
Explorations Newsletter, 1(2), 12–23.
147.
Srivastava, T., Desikan, P., & Kumar, V. (2005). "Web mining--concepts,
applications and research directions". In Foundations and Advances in Data Mining.
Vol. 180., pp. 275–307. Springer Berlin Heidelberg.
148.
Stefano, B., Gisella, F.,(2001), “Insurance fraud evaluation: a fuzzy expert
system”, in the Proceedings of the 10th IEEE International Conference on Fuzzy
Systems 3, pp.1491-1494.
149.
Stolfo S., Fan, D.W., Lee, W., Prodromidis, A. , Chan, P. (2000), “Cost-Based
Modeling for Fraud and Intrusion Detection: Results from the JAM Project”, in
Proceedings of the DARPA Information Survivability Conference and Exposition, vol.
2, pp. 130-144.
150.
Stolfo, S. J., Fan, D. W., Lee, W., Prodromidis, A., Chan, P. (1997), “Credit card
fraud detection using meta-learning: issues and initial results”, in Proceedings of
the AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, ”,
pp. 83-90.
151.
Stolpmann, M. (2001). "Online-Marketingmix". Galileo Press.
152.
Strauss, J., Frost, R., & Ansary, A. I. (2009). "E-marketing". Pearson Prentice
Hall.
153.
Sudjianto, A. Nair, S., Yuan, M., Zhang, A.J., Kern, D., Cela-Diaz, F. (2010),
“Statistical Methods for Fighting Financial Crimes”, Technometrics 52 (1), pp. 5-19.
154.
Syeda, M., Zhang, Y. and Pan, Y. 2002. Parallel granular neural networks for
fast credit card fraud detection. In Proceedings of the 2002 IEEE International
Conference on Fuzzy Systems.
155.
Tadepalli S., Sinha, A. K., and Ramakrishnman, N. (2004), "Ontology driven
data mining for geoscience". Proceedings of 2004 AAG Annual Meeting, Denver,
USA.
156.
Ting, I-H., Wu, H-J. (2009), “Web Mining Applications in E-Commerce and EServices”. Studies in Computational Intelligence, Vol. 172.
157.
Tuo, J., Ren, S., Liu, W., Li, X., Li, B., Lei, L. (2004), “Artificial Immune System
for Fraud Detection”, in Proceedings of the 2004 IEEE International Conference on
Systems, Man and Cybernetics, pp. 1407 – 1411.
158.
Vatsa, V., Sural, Sh., Majumdar, A.K. (2005), “A Game-Theoretic Approach to
Credit Card Fraud Detection”, in Jajodia, S. and Mazumdar, Ch. (eds), Information
Grant Agreement 315637
PUBLIC
Page 142 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
Systems Security, Lecture Notes in Computer Science (3803), Springer Berlin
Heidelberg, pp. 263-276
159.
Vidhate, D. and Kulkarni, P. (2012), “Cooperative Machine Learning with
Information Fusion for Dynamic Decision Making in Diagnostic Applications”. In:
2012 International Conference on Advances in Mobile Network, Communication
and its Applications (MNCAPPS), pp. 70–74.
160.
Web Analytics Association (2008), "Web Analytics Definitions".
161.
Weening, A. (2013), "Europe B2C Ecommerce Report 2013.
162.
Weston, C., Hand, D. , Adams, N. M., Whitrow, C., Juszczak, P. (2008), “Plastic
card fraud detection using peer group analysis”, Advances in Data Analysis and
Classification 2 (1), pp. 45-62.
163.
Wheeler, R., Aitken, S., (2000), “Multiple Algorithms for Fraud Detection”,
Knowledge-Based Systems 13, pp. 93-99.
164.
Whitrow, C., Hand, D., Juszczak, P., Weston, D., Adams, N. (2009),
“Transaction aggregation as a strategy for credit card fraud detection”, Data Mining
and Knowledge Discovery 18(1), pp. 30-55.
165.
Wightman J. (2003), “Computer immune techniques in e-commerce fraud
detection”, Thesis submitted to the School of Information Systems and Technology
Management, The University of New South Wales.
166.
Wong, N., Ray, P., Stephens, G., Lewis, L. (2011), “Artificial immune systems
for the detection of credit card fraud: an architecture: prototype and preliminary
results”, Information Systems Journal 22 (1), pp. 53-76.
167.
Woo, J. W. (2012), “Market Basket Analysis Algorithm on Map/Reduce in
AWS EC2”. International Journal of Advanced Science and Technology. Vol. 46: 2537
168.
Woon, Y.-K., Ng, W.-K., and Lim, E.-P. (2005), Web Usage Mining: Algorithms
and Results. Web Mining: Applications and Techniques, 373.
169.
Xiuhua, L. (2012), “Research on Individual Tourism Service System Based on
Web Mining”. Advances in Intelligent and Soft Computing. V 141, 2012, pp 293-298.
170.
Xu, J., Sung, A., Liu, Q. (2007), “Behavioural data mining for fraud detection”,
Journal of Research and Practice in Information Technology 39 (1), pp. 3-18.
171.
Zaïane, O. R., Xin, M., & Han, J. (1998). "Discovering web access patterns and
trends by applying OLAP and data mining technology on web logs". In Research and
Technology Advances in Digital Libraries, 1998. ADL 98. Proceedings. IEEE
International Forum on. pp. 19–29.
172.
Zaslavsky V. and Strizhak A. (2006), “Credit card fraud detection using selforganizing maps”, Information and Security 18, pp. 48-63.
Grant Agreement 315637
PUBLIC
Page 143 of 144
SME E-COMPASS
D1.1 – SME E-COMPASS Methodological Framework– v.1.0
173.
Zhang, X., He, K., Wang, J., Wang C., and Li, Z. (2013), “On-Demand Business
Rule Management Framework for SaaS Application”. In: Cloud Computing and
Services Science, Springer, pp. 135–150.
174.
Zhao, Y., Sundaresan, N., Shen, Z., and Yu, P. (2013), “Anatomy of a web-scale
resale market: a data mining approach.” In Proceedings of the 22nd international
conference on World Wide Web (WWW '13), Switzerland, 1533-1544.
Grant Agreement 315637
PUBLIC
Page 144 of 144