Download Establishing Fraud Detection Patterns Based on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Establishing Fraud Detection Patterns
Based on Signatures
Pedro Ferreira1 , Ronnie Alves1 , Orlando Belo1 and Luı́s Cortesão2
1
2
University of Minho, Department of Informatics, Campus of Gualtar,
4710-057 Braga, Portugal
{pedrogabriel,ronnie,obelo}@di.uminho.pt
Portugal Telecom Inovação, SA, Rua Eng. José Ferreira Pinto Basto
3810 - 106 Aveiro - Portugal
[email protected]
Abstract. All over the world we have been assisting to a significant
increase of the telecommunication systems usage. People are faced day
after day with strong marketing campaigns seeking their attention to
new telecommunication products and services. Telecommunication companies struggle in a high competitive business arena. It seems that their
efforts were well done, because customers are strongly adopting the new
trends and use (and abuse) systematically communication services in
their quotidian. Although fraud situations are rare, they are increasing
and they correspond to a large amount of money that telecommunication
companies lose every year. In this work, we studied the problem of fraud
detection in telecommunication systems, especially the cases of superimposed fraud, providing an anomaly detection technique, supported by a
signature schema. Our main goal is to detect deviate behaviors in useful
time, giving better basis to fraud analysts to be more accurate in their
decisions in the establishment of potential fraud situations.
1
Introduction
Today communication is a common act of living. Recent telecommunications
market analysis show that companies have been working very well, especially in
the area of new products and services. Telecommunications companies have been
continuously and significantly improving their business incomes and extending
their influence in the market. However, some studies show that telecommunication companies lose large amounts of money every year due to a large diversity of
fraudulent cases. Due to the fact that fraud is continuously evolving and telecommunications networks generate huge amounts of data (sometimes of the order
of several gigabytes per day) the detection and identification of fraud cases is
extremely hard and costly, demanding for huge amount of resources (human and
material) to fight it. Essentially, two main types of fraud can be distinguished
[19]: subscription and superimposition fraud. In the former, the fraudsters (faking identifications) especially create a new account without having the intention
to pay for the used services. Typically, these cases reveal an intensive high-usage
2
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
right from the beginning. In the latter, the fraudsters make an illegitimate use
of a legitimate account by different means. In this case, some abnormal usage
is blurred into the characteristic usage of the account. This type of fraud is
usually more difficult to detect and poses a bigger challenge to the telecommunications companies. Some of the telecommunications companies use since the 90’s
decade several kinds of approaches based on statistical analysis and heuristics
methods to assist them in the detection and categorization of fraud situations.
Additionally, some of them adopted the use and exploitation of data mining and
knowledge discovery techniques.
Telecommunications scenarios pose big challenges to traditional data mining
techniques. Here can we emphasize three of these challenges. 1) The abstraction level of the analysis. Fraud analysts are typically interested in the customer
behavior and not in the call details. For each call, telecommunication systems
generate a record - call detail record (CDR) - that has enough information to
completely describe a call. However, a CDR is not by itself enough to detect a
fraud situation. We are interested in studying the customer behavior and not
individual phone calls. Thus, based on CDRs, we must use some kind of profiling techniques in order to reveal, with certain accuracy, the customer behavior
along the time. Signature records that include a large diversity of features, such
as number of calls, average call duration, average number of calls received, etc.,
can be used to establish customer profiles. Additionally, customer data (age, job,
location, price plan and so on) which is of critical importance in this analysis can
also be used in this profile construction. Therefore, we can resume three levels of
data [18]: call, behavior and client. 2) Inappropriateness of data for supervised
techniques. Data Mining techniques are more suitable to work only in the last
two levels of data, and, typically, they can be divided in two categories: Supervised and Unsurpervised Learning. In supervised techniques there is a feedback
to the system since the inputs and respective outputs are known. In this case
all the instances in data have assigned a predefined class. In unsupervised techniques the system has no hints in how to find the correct answer since no apriori
discrimination of the data exists. From the fraud detection point of view, where
the goal is to discriminate between normal and fraudulent users, the supervised
techniques seem to be more appropriate to the problem. Nevertheless, due to
several reasons, like the inexistence of previously known fraud cases, or the imbalance (fraud occurs in a relative small number) of the data cases [18], the
direct application of supervised techniques is not always possible. 3) The need
for real time or almost real time update of the detection system information due
to the high costs associated with fraud.
In order to capture the characteristics of an user behaviour the concept of
signature can be applied. This concept has already been used successfully for
anomalous detection in many areas like credit card usage [11], network intrusion
[13, 11] and in particular in telecommunications fraud [3, 21, 1, 5]. A signature
corresponds to a set of information that captures the typical behavior of a user.
For example, the average number of calls, time of the calls, area where the calls
are made and so on. Thus, if in a given moment, an user deviates from which is
Establishing Fraud Detection Patterns Based on Signatures
3
its typical behavior expressed by its signature, that can be a motive to trigger
an alarm for further analysis of that user. In the fraud and intrusion detection
systems, signatures can be used in two distinct ways:
– Detection based in User Profiles - The signature of the user is compared
against a database of cases of known non legitimate use. This kind of method
fits under the class of supervised learning technique.
– Detection based in Signatures - The user signature is used as a comparison
basis. A possible differentiation between the actual behaviour of the user
and its signature may reveal an anomaly situation.
In this paper we tackle the problem of superimposed fraud detection in
telecommunication systems. We propose an anomaly detection technique based
on the concept of signature. Our goal is to detect deviate behaviors in useful
time, giving better basis to analysts to be more accurate in their decisions in the
establishment of potential fraud situations. In the following sections, we describe
the signature based detection models and algorithms developed as well as the
current functional architecture of the proposed system.
2
Detecting Fraud Situations Based on Signatures
Our technique has as a core concept the notion of signature. We emphasize the
work of Cortes and Pregibon [5], since it was the main inspiration for the use of
signatures. In this section, we start by presenting our own definition of signature.
Next, we present all its relevant elements and the theoretical background that
allows computing the statistical-based distances of the signatures. Finally, we
explain how the management (start and update) of the signatures is done.
2.1
Definition of Signature
A signature of a user corresponds to a vector of feature variables whose values
are determined during a certain period of time. The variables can be simple,
if they consist into a unique atomic value (ex: integer or real) or complex, if
they consist in two co-dependent statistical values, typically the average and the
standard deviation of a given feature.
A signature S is then obtained from a function ϕ for a given temporal window
w, where S = ϕ(w). We consider a time unit the amount of time in which the
CDRs are accumulated and that in the end of this period are processed. The
value of w is proportional to the time unit, w = α × ∆t. For example, if we
consider the ∆t of one day we will have α = 7 for a temporal window of one
week.
In figure 1 we illustrate the scheme of the evolution of a signature through
time. S corresponds to the initial value of the user signature. After a shift of one
unit of time, the signature S is then updated to S 0 , according to the new usage
information (CDRs that happen between the end of S and S 0 ). For a given set of
4
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
S
S'
?
t
W
t
t
Fig. 1. Evolution of a signature through time.
CDRs (shadow area) verified in a unit of time ∆t, a comparison against the most
actual value of signature can be made in order to detect deviating behaviors.
Since this information is processed to resume the user behavior in a certain time
period we denote it as a summary. The reason for this denomination will be
made more clear in the next sections.
The described type of processing is time oriented, since the set of user actions
are accumulated, kept and processed during the time unit for posterior analysis.
On the other hand, we can have an action oriented processing that makes the
direct comparison of each new action (CDR) against the signature.
A signature can be updated according to one of these two modes. In [5] it is
pointed that the most adequate model for the updating is the action oriented.
This is mainly due to the elevated costs associated with fraud, which require
a constant (for every call) update of the signature. In this work we choose the
time oriented mode for signature updating. The reason for this is the high processing cost of this operation. As we will see in the following sections, signature
processing requires the analysis of massive volumes of data. Since the used time
unit can be made not too large (typically one day or less) a reasonable trade-off
between processing cost and up to date information is achieved.
2.2
Elements of a signature
Each of the signature feature variables is obtained directly from coded fields
from one or more CDRs. These feature variables correspond to a statistical
value which describe a certain aspect of the user behavior. Both a signature and
a summary correspond to the set of all the variables. The main difference resides
in the time window that they resume. In order to capture the user behavior in
different situations a signature reflects a longer time window, like for example a
week, a month or even half year period. On the other hand, by reasons already
pointed out, a summary reflects a much smaller time period, like for example an
hour, a half day or complete day.
Our proposed model contemplates simple and complex variables, a simple
variable corresponds to an average value and a complex variable to the average
and standard deviation of a certain feature. In table 1 we list the feature variables
and the respective type.
Establishing Fraud Detection Patterns Based on Signatures
Description
Duration of Calls
Number of Calls - Working days
Number of Calls - Weekends and Holidays
Number of Calls - Working Time (8h-20h)
Number of Calls - Night Time (20h-8h)
Number of Calls to the Different national networks*
Number of Calls as Caller (Origin)
Number of Calls as Called (Destination)
Number of International Calls
Number of Calls as Caller in Roaming
Number of Calls as Called in Roaming
5
Type
Complex
Complex
Complex
Complex
Complex
Simple
Simple
Simple
Simple
Simple
Simple
Table 1. Description of the features variables used in signature and summary and
the respective type. *Currently in Portugal exists three wireless telecommunications
companies and one major company in fixed telecommunications.*
The choice of the type of the variables depends on several factors, like the
complexity of the feature described or the data available to perform such calculation. A feature like the duration of the calls shows a significant variability which
is much better expressed through an average/standard-deviation parameter. A
feature like the number of international calls is typically much less frequent and
thus an average value is sufficient to describe it.
2.3
Anomaly Detection
Given a set of CDRs, C, we would like to know if during the corresponding
period of time the user deviates from its typical behavior. First of all, there is
the need to process such information. The processing of C, PC , basically consists
in extracting from C the set of feature variables described in table 1. Once this
step is performed, we have two vectors of feature variables, S(signature) and P C ,
available for comparison. For the determination of the distance between these
two vectors, the usual distance functions like the Euclidean distance can not be
applied, since the vectors contain complex variables. Besides, we would like to
look for the problem from a probabilistic point of view, i.e. the distance measure
corresponds to some probabilistic value of PC being different form S.
Since the features in the signature have different types, each variable has to
be evaluated by a distinct sub-function. Thus, the dist function is composed by
the several sub-functions: dist = φ(f1 , f2 , . . . , fn ).
Next, we present through a semi-formal example the details of our distance
function. Consider a simplification of a signature S = {(µa , σa ); µb ; µc ; (µd , σd )},
where the first and the last feature variables are complex and the second and
the third are simple variables. Let PC = {(µ0a , σa0 ); µ0b ; µ0c ; (µ0d , σd0 )} a vector of
variables from a period ∆t already processed. The proposed distance function
can be presented as:
dist(S, C) = α1 · f1 (S.a, C.a) + α2 · f2 (S.b, C.b) + α3 · f3 (S.c, C.c) + α4 · f4 (S.d, C.d) (1)
6
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
The formula 1 is a linear combination of the distances observed in each
of the feature variables. The constants αi are a weighting factor for each of
the variables and they can express the importance given to each feature when
determining anomaly deviation. These values are provided by the fraud analyst.
Since he/she may wants to observe different fraud’s situations. Different distance
functions can be provided, by setting the weighting factors αi to different values.
This way, the distance function is now defined as in 2.
Dist(S, C) = max{dist1 (S, C), dist2 (S, C), . . . , distm (S, C)}
(2)
The main point of using a distance function is that if the distance between
S and C exceeds a certain threshold, ξ defined by the analyst, i.e. Dist(S, C) >
ξ then an alarm should be raised to future inspection. Otherwise, the user is
considered to be within its expected behavior.
2.4
Distance Between Feature Variables
From a statistical point of view, it is frequently acceptable that many random
variables have likelihood distributions that can be appropriately described by a
normal distribution, if the µ and σ are specified [16]. The normal distribution
give us a reasonable approximation to many scientific variables that occur in
real world situations. According to this , we suggest an adaptation of the normal
distribution function to measure the distance between complex feature variables.
For a given variable X, where X ∼ N (µ, σ), the Z-score function provides the
likelihood of X taking the value of x, P (X = x) = P (Z = x−µ
σ ). In our particular
case, we want to measure for a feature variable X taking a value of x the distance
from the typical behavior, i.e. the average value. The Z-score function provides
a larger likelihood as the value of X tends to µ, being maximal if X = µ.
Since we are measuring a distance, we want that our distance function returns
a value that is inversely proportional to the likelihood of X taking the value of
µ. For that, we only need to subtract our likelihood value P to the accumulated
likelihood, that is one, fN ormal = 1 − P . With this formula, distant values of
X from µ have a smaller value of fN ormal . Considering the example of the last
section, f1 and f4 correspond to fN ormal where µa and σa are the parameters
that describe the normal distribution of the feature a and µ0a the value being
evaluated.3 To measure the distance between simple feature variables we can use
a simple distance or a any other distribution function measure. We propose the
use of the Poisson non cumulative distribution [16, 17, 22]. This function has its
most important application in the counting of the number of events that occur in
a certain time interval or spatial region, when the events are independent from
each other. The probability density function of a Poisson variable is given by
formula 3. The constant e corresponds to the napier number, λ is the expected
value that in our case correspond to the average value described by the signature
and k corresponds to the observed value.
3
The value of σa will only be considered for updating of the signature.
Establishing Fraud Detection Patterns Based on Signatures
P (N = k) =
e−λ λk
k!
7
(3)
In order to measure the probabilistic distance of the observed value k and
the expected value λ of a variable X is given by: fP oisson = dist(λ, k) =
|P (X=λ)−P (X=k)|
. N is the normalizing factor. Since the Poisson function is
N
non-symmetric and only defined for values greater than zero, if X > λ then
N = P (X = λ) − P (X = ∞) w P (X = λ) and N = P (X = λ) − P (X = 0) if
0 6 X < λ.
2.5
Signature updating
Before describing how the update of a signature is performed, we should say that
the initialization of a signature is a straightforward process. The initial signature
S0 corresponds to a summary for the period of the initial time window w0 . As
we already mentioned in a previous section, the update can be performed in a
time oriented or action oriented mode. The chosen mode for this work was the
former. In either cases, it is necessary to weight the impact of the new action
or set of actions in the new signature values. Following the ideas of [5, 2], the
update of a signature St in the instant t + 1, St+1 , through a set of processed
CDRs PC is given by the formula 4.
St+1 = β · St + (1 − β) · PC
(4)
The constant β indicates the weight of the new actions C in the values of the
new signature. Depending on the size of the time window w this constant can be
adjusted. In [5] it is pointed that a daily update with a value of β = 0.85 allows
to account for the information of the last 30 days. With a value of β = 0.5 only
the last 7 days are considered in the signature values. This constant can always
be tuned by the fraud analyst.
In our system, in contrast to the system in [5], the value of the signature
is always updated. If the Dist(St , C) 6 ξ then the user is considered to have a
normal behavior. If Dist(St , C) > ξ then an alarm is triggered, but the signature
continues to be constantly updated. The reason for this is that the alarm still
needs to pass through the analysis of the company fraud expert. It can happen
the case that the analyst considers it as a false alarm and the user behavior is
within some expected behavior. The continuous update of that user signature
avoids the loss of information that was gathered between the moment when the
alarm was triggered and the moment the analyst gives the verdict.
3
Model Behavior
In the next two sections we describe how the signature and summary information
is managed through the entire system.
8
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
input : SummList(List of New Summaries)
/* Compare each Summary against the respective Signature and detect
anomalous behaviors
*/
1 foreach Summ in SummList do
2
userId = getU serId(Summ);
3
signId = getSignId(userId);
4
if signatureIsActive(signId) == T RU E then
5
w = createW indowT imeF rame();
6
Sign = loadSignature(signId);
7
if Dist(Sign, Summ) 6 ξ then
8
updateSignature(Sign, Summ, w);
9
else
10
updateSignature(Sign, Summ, w);
11
triggerAlarm(userId);
12
clientT oQuarantine(userId);
13
end
14
end
15 end
Algorithm 1: Pseudo algorithm that performs the anomaly detection by comparing
the new incoming summaries with the respective signatures.
3.1
Pseudo-Algorithm
The functioning logic of the system is in “batch” mode, i.e. always that new
summaries are available, like for instance at the end of the day, the list of summaries is traversed and a comparison against the respective signature is made. In
algorithm 1, the foreach cycle between line 1 and 15 processes all the incoming
summaries. Line 2 and 3 gets the respective user and signature identification.
Next, it is verified if the signature is in an active state, which corresponds to
an up to date signature. Line 5 creates a referential for the window frame that
is being analyzed and in line 6 all the information relative to user signature is
fetched from the database. Lines 7 to 13 tests the distance between the user
signature and summary. If an alarm is raised the user becomes part of a “Black
List”, which we call quarantine. In either cases the signature is always updated
(lines 8 or 10).
3.2
Detecting Anomalies
The anomaly detection procedure consists in a process of several steps that is
represented in figure 2. The process starts by the loading step, which is used to
import the information to the database of the system. This information refers
to the signature and summary information of each user. The signatures are
imported once, when the system is started. All the signatures of a user are kept
through time. Such information will be used for posterior analysis. A signature
may have two status “Active” or “Expired”. For each client only one signature
can have the Active state, and it is the most up to date one. The processing
Establishing Fraud Detection Patterns Based on Signatures
9
step corresponds to the algorithm described in section 3.1, where the Active
signatures are used for the anomaly detection. In this step, the active signature
is updated (see section 2.5) and marked as active. If an alarm is raised, the client
is put on the quarantine list. This corresponds to the triggering alarm step that
can be described in the section 2.3 and in section 2.4. Finally, all the raised
alarms have to pass through the analyst’s verification in order to determine if
this alarm corresponds or not to a fraud scenario.
Detecting Anomalies
Signature has expired
/ loading
/ processing
Signature
Signature is Actual
Client is fraud
/ triggering alarm
/ verifying client
Client to Quarantine
Client is no fraud
Fig. 2. State Chart of the Signature flow.
4
Evaluating Alarms
The system interface is inspired on the ideas of a dashboard system, which shows
a complete set of information to facilitate the evaluation process. The analyst
has several tools to investigate those alarms. Here, we give a brief overview of
three of proposed tools.
An alarm corresponds to a situation where the distance between the signature and a summary has exceed the threshold. It is interesting to analyze what
were the feature variables with the greatest impact, which after verification, has
caused an alarm. This impact can be calculated simply by the ratio of each
feature variable (fv) in the overall distance (formula 1). Figure 3 (a) shows a
piechart for the distribution of the impact of seven feature variables.
In order to have a more general overview of the impact of each feature variable, the TOP-K alarms associated to a given user grouped, and the aggregation
([sum(f v1 ), sum(f v2 ), . . . , sum(f vn )]) of these impacts is calculated. Figure 3
(b) shows the aggregation of the impacts for the TOP 5 alarms of user A.
The type of information presented in figure 3 (a) and (b) is very important
to the analyst because it supports the understanding of the user behavior and
points toward the threshold values that should be used to capture the alarms.
10
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
Fig. 3. (a) Impact of each feature variable; (b) Aggregation of the impact features of
a given client.
In order to observe the behavior of a given client during a certain period of
time the analyst can make use of a time series chart. In this graphic, all the
calculated distances for the select time window can be used to study whether
the client shows any particular trend in its behavior. Figures 4 (a) and (b) show
two examples, for two different users during the period of one month. Note that
in the points where the distance (also called score for output reasons) exceeded
the threshold (dashed line) an alarm was raised. Two different threshold values
were used for illustration purposes.
Fig. 4. Graph of the distance values(score) of two users in the time interval of one
month.
5
Related Work
Fraud detection can be done at two levels, call or behavior, and with two different
approaches, user profile or signature based. Most of the techniques use the CDR
data to create an user profile and to detect anomalies based on these profiles.
Establishing Fraud Detection Patterns Based on Signatures
11
The work from [9, 8] is an example of this. They mined large amounts of CDRs in
order to find patterns and scenarios of normal usage and typical fraud situations.
These scenarios were then used to configure monitors that “observe” the user
behavior with relation to that type of fraud. These monitors are then combined
in a neural network, which raises an alarm when sufficient support of fraud
exists. This type of system can be classified in a rule based approach, since
it relies in the triggering of certain rules due to abnormal usage. The system
presented in [18] is also an example of a rule based system that work in data
behavior level. But as stated in [11], rule based systems have the drawback of
requiring expensive management of rules. Rules need to be precise (avoid false
positive alarms) and constantly evolving (detect new scenarios), which result in
very time-consuming programming.
The most common and best succeeded methods [21] for fraud analysis are
signature based. These methods detect the fraud based on deviation detection by
comparing the recent activity with the user behavior data, which is expressed
through the user signature. In this context, our work adapts and extends the
work of [5] by reformulating the notion of signature and by introducing the notion
of statistical-based distances to detect anomalies. Furthermore, we reduce the
computation cost by using simple statistical functions avoiding processing costly
histograms. A clear problem with a histogram approach is that discretization
intervals or buckets must be chosen, and what is (right) for one customer may
be (wrong) for another.
Other approaches have also been widely applied to fraud analysis, like for
example neural networks [15, 19]. In [20] the authors describe neural networks,
mixture models, and Bayesian networks telecommunication fraud detection, derived from call records stored for billing. Another applied technique is link analysis. Here the clients links (called numbers) are updated over time, establishing a
graph of called, “communities of interest” [4], that can easily reveal networks of
fraudsters. These methods are based on the observation that fraudsters seldom
change their calling habits, but are often closely linked to other fraudsters [14].
In [10] several methodologies are presented for outlier detection. Lately, there
are some efforts to exploration of anomaly metadata [12], pre-defined stream
selections with concept-drifting [6] and states approaches based on alarms [7].
6
Conclusions and Future Work
Fraud detection for mobile telecommunications is a relatively recent area of research. Due to its characteristics this type of fraud requires (nearly) real-time
and individualized customer analysis. Literature in this area, points that customer signatures provide a mean to describe the current customer behavior and
that can be used to efficiently detect fraud situations. In this work, we propose
an anomaly detection system to support mobile telecommunications fraud detection. Signatures form the basis of the anomaly detection mechanism. We have
adapted and extended the concept of signature in order to accurately capture
the statistical information that describes the user behavior and to increase the
12
Pedro Ferreira, Ronnie Alves, Orlando Belo and Luı́s Cortesão
precision on the anomaly detection. Thus, we provide a new definition of signature along with the respective statistical tools for its analysis. We also provide
the computational details for the management of the signatures. It is expected
that the proposed system will have a critical impact in the fraud prevention and
detection procedures of the mobile telecommunications providers. The system
constantly adapts to the user behavior patterns. Deviations from these patterns
results in an indication to the fraud analyst that an anomalous and eventually
fraud situation has occurred.
At the moment of this writing, the system implementation has been finished. This work is now in its experimental stage. Currently, we are studying
the parameters tuning, the scalability issues and the analyst interaction with
the system. We have also been investigating the application of the signatures
for user segmentation. We have applied clustering techniques in order to find
groups of related users. We believe that the analysis of cluster migrations could
also shed light on fraud situations.
7
Acknowledgments
This work was financed by Portugal Telecom Inovação, S.A. under a service
acquisition and knowledge transference protocol celebrated with University of
Minho. The authors gratefully acknowledge Francisco Paz, João Lopes, Filipe
Martins, Eduardo Taborda, and João Pias for their fruitful support on this work.
References
1. Richard J. Bolton and David J. Hand Statistical. Statistical fraud detection: A
review. Statistical Science, 17(3):235–255, January 2002.
2. P. Burge, J. Shawe-Taylor, Y. Moreau, H. Verrelst, C. Stoermann, and P. Gosset.
Fraud detection and management in mobile telecommunications networks. In Proceedings of the 2nd IEEE European Conference on Security and Detection,, volume
437, pages 91–96, London, April 1997. IEEE.
3. M. Cahill, D. Lambert, J. Pinheiro, and D. Sun. Handbook of massive data sets,
chapter Detecting fraud in the real world, pages 911–929. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
4. C. Cortes, D. Pregibon, and C. Volinsky. Communities of interest. Intelligence
Data Analysis, 6(3):211–219, 2002.
5. Corrina Cortes and Daryl Pregibon. Signature-based methods for data streams.
Data Mining and Knowledge Discovery, (5):167–182, 2001.
6. Kaustav Das, Andrew Moore, and Jeff Schneider. Belief state approaches to signaling alarms in surveillance systems. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 539–544,
New York, NY, USA, 2004. ACM Press.
7. Wei Fan. Systematic data selection to mine concept-drifting data streams. In
Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 128–137, New York, NY, USA, 2004. ACM Press.
Establishing Fraud Detection Patterns Based on Signatures
13
8. Tom Fawcett and Foster Provost. Combining data mining and machine learning
for effective user profiling. In Simoudis, Han, and Fayyad, editors, Proceedings on
the Second International Conference on Knowledge Discovery and Data Mining,
pages 8–13, Menlo Park, CA, 1996. AAAI Press.
9. Tom Fawcett and Foster Provost. Adaptative fraud detection. Data Mining and
Knowledge Discovery, pages 1–28, 1997.
10. Victoria Hodge and Jim Austin. A survey of outlier detection methodologies.
Artificial Intelligence Review, 22(2):85–126, 2004.
11. Y. Kou, T. Lu S. Sirwongwattana, and Y. Huang. Survey of fraud detection
techniques. In Proceedings of 2004 IEEE International Conference on Networking,
Sensing and Control, Taipei, Taiwan, March 2004. IEEE, IEEE.
12. Tysen Leckie and Alec Yasinsac. Metadata for anomaly-based security protocol
attack deduction. IEEE Trans. Knowl. Data Eng., 16(9):1157–1168, 2004.
13. T.F. Lunt. A survey of intrusion detection techniques. Computer and Security,
(53):405–418, 1999.
14. John McCarthy. Phenomenal data mining. Commun. ACM, 43(8):75–79, 2000.
15. Yves Moreau, Herman Verrelst, and Joos Vandewalle. Detection of mobile phone
fraud using supervised neural networks: A first prototype. In ICANN ’97: Proceedings of the 7th International Conference on Artificial Neural Networks, pages
1065–1070, London, UK, 1997. Springer-Verlag.
16. Myers and Myers. Probability and Statistics for Engineers and Scientists. Prentice
Hall, 6th edition.
17. Antonio Pedrosa and Silvio Gama. Introdução Computacional a Probabilidade e
Estatistica. Porto Editora, 2004.
18. Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi Pinkas. Discovery of fraud rules for telecommunications challenges and solutions. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 409–413, New York, NY, USA, 1999. ACM Press.
19. J. Shawe-Taylor, K. Howker, P. Gosset, M. Hyland, H. Verrelst, Y. Moreau, C. Stoermann, and P. Burge. In Business Applications of Neural Networks, chapter Novel
techniques for profiling and fraud detection in mobile telecommunications, pages
113–139. Singapore: World Scientific, 2000.
20. Michiaki Taniguchi, Michael Haft, Jaakko Hollmen, and Volker Tresp. Fraud detection in communications networks using neural and probabilistic methods. In
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP’98), number 2, page 12411244, 1998.
21. Gary M. Weiss. Data Mining in Telecommunications. kluwer, 2004.
22. Eric W. Weisstein. Poisson distribution. From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/PoissonDistribution.html, 2006.