Download 15: Outlier Mining in Data Streams Using Massive Online Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Conceptions on Computing and Information Technology
Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808
Outlier Mining in Data Streams Using Massive
Online Analysis Framework
Prof. Dr. P K Srimani
Malini M Patil
Former Director, R & D
Bangalore University
Bangalore, India
[email protected]
Assistant Professor, Dept. of ISE,
J S S Academy of Technical Education,
Bangalore, India
[email protected]
Abstract— Outlier mining for data streams is completely different
from that for traditional datasets. An outlier is a data point
which significantly conforms well to a defined abnormal behavior
and is application dependent. Any data mining technique can
learn the pattern from the dataset and then compares every data
point to the pattern to detect outliers. The advancement of the
technology has led the large flow of data in the digital form. Data
generated by applications like sensor network, web-click
monitoring, network traffic monitoring, etc. are huge and have
large data distributions. Such data are referred to as data
streams. Outlier mining is totally different for data streams
because the entire dataset is never available due to their
ubiquitous nature. In such cases outlier detection is a very
challenging research issue. Present work aims at mining outliers
from data streams using Massive Online Analysis (MOA) frame
work using distance based algorithms. The algorithms used to
mine the outliers are simple continuous outlier detection(Simple
COD) algorithm and micro cluster based continuous detection
(MCOD) algorithm. Both algorithms are compared with different
sizes (5000, 10000, 15000, 20000, 25000, 30000) of data sets. A
comparative study of both the algorithms is conducted and the
results are found to be very interesting.
Keywords- Data streams, Simple COD, MCOD, Massive online
Analysis, Outliers, Inliers
I.
INTRODUCTION
Outlier detection(mining) is also termed as anamoly
detection. It is one of the important task of data mining[1].The
task aims at discovering the outliers, which are some specific
patterns that show a significant unexpected behavior. Few of
the applications in which outliers can be considered as
important elements are fraud detection, network monitoring
systems, sensor networks and many more. Outliers may
appear in a dataset for numerous reasons, like malicious
activity, instrumental error, setup error, changes of
environment, human error, catastrophe, etc. Regardless of the
reason, outliers may be interesting and/or important to the user
because of their diverse nature compared to normal data
points. Some people define outliers as problems, some people
define them as interesting items, but in any case, they are
unavoidable. They also are addressed by different names as
abnormalities, discordants, deviants or anomalies in the data
mining and statistics literature. In [2] The author defines
outliers as an observation which deviates so much from the
other observations as to arouse suspicions that it was
generated by a different mechanism. then sometimes there
arises a question "why to detect an outlier?" The following
reasons are the answers for the above questions. In a network
monitoring system, Data is collected from heterogeneous
sources. Because of some malicious attack the data may show
unusual behavior. To detect such behavior outlier analysis is
necessary.
Another important area of outlier analysis is patient
disease diagnosis. Patients are advised to undergo different
types of diagnose procedures like MRI, ECG, C.T.SCAN etc,
These diagnose procedures are conducted with different
devices. Based on the report of such tests, the patient can be
diagnosed. Unusual patterns in such data, effectively show
different types of disease conditions Similar examples can be
quoted from spatial data and cyber data. Outlier mining in data
streams[3,4,5] is a very challenging task because of their
ubiquitous nature. The technique should address many
research issues related to handling data streams. They are
execution time, uncertainty, concept drift, arrival rate,
dimensionality, usage of memory etc. The present work aims
at performing outlier mining using massive online analysis
framework using distance based algorithms.
Outliers can be classified into three major categories as
follows.
Type I Outliers- An isolated individual data point in a
dataset is termed as a Type I outlier. By definition they are the
simplest type and it is very easy to identify them. Intuitively
they are far from other data points in the dataset in terms of
attribute values.
Type II Outliers- A data point that is isolated with respect
to other data points in the context is called a type II outlier.
Type III Outliers- A particular group of data points that
appear as outliers with respect to the entire dataset is termed
type III outliers. No data point in a small subset is an outlier
with respect to the other points in the subset, but as a group,
they are outliers.
The rest of the paper is organized as follows: section II is
about related work; methods and models are discussed in
section III; Experiments and results are discussed in section IV;
Conclusion and future work are discussed at the end.
33 | 9 5
International Journal of Conceptions on Computing and Information Technology
Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808
II. RELATED WORK
In the recent past outlier mining is considered as a very
challenging research area. Outlier detection for data streams is
a new area of research compared to the long history of outlier
detection in statistical data [14,15].The distance based outlier
mining is proposed by [6]. Density based clustering over an
evolving stream with noise is proposed in[7]. Novel method is
proposed by [8] with regard to queries in data streams using
distance based approach. Algorithms and different
methodologies for mining distance based outliers is proposed
in [9]. Extensive work on continuous monitoring of distance
based outliers over data streams is presented in [10]. The state
of art of algorithms like COD, MCOD to detect outliers are
proposed by [11]. Supervised approach of outlier mining can
be found in [12] where as unsupervised in [13].
III.
METHODS AND MODELS
This section mainly emphasis on the framework used in
outlier mining, data stream generator, the configuration set up
and about the algorithms used to detect the outliers.
A. Massive Online Analysis Framework(MOA)
Massive
online
analysis
(MOA)
framework
[16,17,18,19,20,21] is a software environment for
implementing algorithms and running experiments for online
learning from evolving data streams. MOA is designed in
such a way that it can handle the challenging problems of data
streams. The state of the art algorithms are implemented in the
framework. They are also scaled up to the real world data sets.
MOA consists of offline and online algorithms for
classification, clustering, outlier mining and regression
modeling. It also consists of tools for evaluation. Thus MOA
is an open source frame work to handle massive, potentially
infinite, evolving data streams. MOA mainly permits the
evaluation of data stream learning algorithms on large streams
under explicit memory limits. The outlier mining algorithm set
up mainly consists of the following steps. viz., i) Select the
stream ii)Select algorithm 1 iii) Select algorithm 2.
Visualization window mainly displays behavior of the selected
algorithms for a specified number of instances. the An initial
configuration model for outlier mining is shown in the fig.1.
B. Algorithms used in the Outlier detection
For the purpose of experimental set up the algorithms [11]
used to mine the outliers are simple continuous outlier
detection(Simple COD) algorithm and micro cluster based
continuous detection (MCOD) algorithm. The improved
efficiency of COD (Continuous Outlier Detection) stems from
the adoption of an event-based approach. Instead of checking
each object continuously, the algorithm computes the next
time point in the future when, due to object departures, an
object may become an outlier and inspects an object only at
that time point.
MCOD[11] (Micro-cluster-based Continuous Outlier
Detection) builds on top of COD and employs the same event
queue. Its distinctive characteristic is that it mitigates the need
to evaluate range queries for each new object with respect to
all other active objects. The solution is based on the concept of
evolving micro-clusters that correspond to regions containing
inliers exclusively. Then the range queries for each new object
are performed with respect to the (fewer) micro cluster centres
instead of the preceding active objects. In realistic data with
few outliers and dense regions, MCOD exhibits the best
performance. Both COD and MCOD have been implemented
in the extended MOA.
C. Data stream generator used in the study.
RANDOMRBF-Generator Generates a random radial basis
function(RBF), introduced by [16]. This generator was devised
to offer an alternate complex concept type that is not
straightforward to approximate with a decision tree model. The
RBF generator works as follows: A fixed number of random
centroids are generated. Each centre has a random position, a
single standard deviation, class label and weight. New
examples are generated by selecting a centre at random, taking
weights into consideration so that centres with higher weight
are more likely to be chosen. A random direction is chosen to
offset the attribute values from the central point. The length of
the displacement is randomly drawn from a Gaussian
distribution with standard deviation determined by the chosen
centroid. The chosen centroid also determines the class label of
the example. This effectively creates a normally distributed
hyper sphere of examples surrounding each central point with
varying densities. Only numeric attributes are generated.
IV. EXPERIMENTS AND RESULTS
The experiments are conducted in Massive Online
Analysis Framework. The Data stream used for the analysis is
RANDOMRBF generator. The varying stream sizes selected
are 5000, 10000, 15000, 20000, 25000, 30000 respectively.
Number of cluster size is 5. The algorithms used in the
experiments
are
simple
continuous
outlier
detection(SimpleCOD) algorithm and micro cluster based
continuous detection(MCOD)algorithm. The statistics are
tabulated in table 1 and table 2. Other results are shown in the
visualization window as shown in fig 2 and 3 are self
explanatory.
Fig. 1 Configuration of outlier Mining in MOA framework
34 | 9 5
International Journal of Conceptions on Computing and Information Technology
Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808
Table 2.Statistics of MCOD algorithm
BOTH
MMU
TPT
NO. OF
INLIER
OUTLIER
INSTANCES
NODES
NODES
INLIER
& OUT
LIER
5000
4331
386
283
107
2.14
10000
8972
719
309
133
3.77
15000
13558
1105
337
191
6.3
20000
18148
1503
349
158
8.65
25000
22711
1929
360
194
10.69
30000
27315
2315
370
221
316.82
(MB)
(ms)
Fig 2. Results of Outlier Mining in MOA framework
Fig. 3 Graph of Evaluation Measures Vs Instance Size for SimpleCOD
algorithm
Fig 3. Results of Outlier Mining in MOA framework
Table 1.Statistics of SimpleCOD algorithm
T
BOTH
MMU
T
(MB)
s)
283
107
36.16
719
309
133
78.89
13558
1105
337
191
127.52
20000
18148
1503
349
158
169.83
25000
22711
1929
360
194
243.13
30000
27315
2315
370
221
NO. OF
INLIER
OUTLIER
INSTANCES
NODES
NODES
INLIER
& OUT
LIER
5000
4331
386
10000
8972
15000
Fig.4 Graph of Evaluation Measures Vs Instance Size for MCOD algorithm
V. CONCLUSION
The
experiments
are
conducted in Massive Online Analysis
14.00
Framework. The Data stream used for the analysis is
RANDOMRBF generator. The varying stream sizes selected are
35 | 9 5
International Journal of Conceptions on Computing and Information Technology
Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808
5000, 10000, 15000, 20000, 25000, 30000 respectively.
Number of cluster size is 5. The algorithms used in the
experiments
are
simple
continuous
outlier
detection(SimpleCOD) algorithm and micro cluster based
continuous detection(MCOD) algorithm. As per the tabulation
of results in table 1 and table 2 it is found that memory
management units (MMU) for both the algorithms is same for
all the instances. The drastic variation is observed in the total
processing time(TPT). MCOD takes less time to execute
except for the instance size of 30000. SimpleCOD takes more
time to execute except for the instance size of 30000. Statistics
of inlier and outlier nodes remains same in both the algorithms.
Finally, the present work establishes that apart from traditional
data mining techniques , outlier mining is also possible in data
streams under the framework of massive online analysis.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Han, J. and Kamber, M.(ed.) "Data Mining : Concepts and Techniques,"
Morgon Kaufmann Publishers, 2007 , San Francisco, CA
Hawkins Identification of outliers, Chapman and Hall 1980.
Aggarwal, C.C. (Ed.),"Data streams: Models and Algorithms," Series:
Advances in Database Systems, Vol. 31, XVIII, 354 p, 2007, ebook
,Springer, Berlin Heidelberg.
Guha, S. , Koudas, N.K. and Shim, K. ,"Data Streams and Histograms,
Proceedings of thirty-third annual ACM Symposium on Theory of
Computing., 2003, pp., 471-475 , ACM Press.
Domingos,P, and Hulten,G. "Mining time-changing data streams,"In
KDD’00, Proceedings of the sixth ACM SIGKDD International
conference on Knowledge discovery and data mining pp., 71-80, 2000,
NY, USA doi:10.1145/347090.347107 ACM Press.
Ramaswamy Sridhar, Rastogi Rajeev, Shim Kyuseok, ”Efficient
algorithms for mining outliers from large data sets, ” Proceedings of the
2000 ACM SIGMOD international conference on Management of data,
New York, NY, USA, pp. 427-438, 2000.
F. Cao, M. Ester, W. Qian, and A. Zhou. "Density-based clustering over
an evolving data stream with noise". In SDM, 2006.
[15]
[16]
[17]
[18]
[19]
[20]
[21]
36 | 9 5
F. Angiulli and F. Fassetti. "Distance-based outlier queries in data
streams: the novel task and algorithms". Data Mining and Knowledge
Discovery, 20(2):290–324, 2010.
E. Knorr and R. Ng. "Algorithms for mining distance-based outliers in
large data sets". In VLDB, 1998.
M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y.
Manolopoulos. Continuous monitoring of distance-based outliers over
data streams. In ICDE, pages 135–146, 2011.
Dimitrios Georgiadis, Maria Kontaki, Anastasios Gounaris, Apostolos
Papadopoulos, Kostas Tsichlas and Kostas Tsichlas "Continuous Outlier
Detection in Data Streams: An Extensible Framework and State-Of-TheArt Algorithms.
B. Z. J. L. Naoki Abe, "Outlier Detection by Active Learning,"
SIGKDD, 2006
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A
Survey," ACM Computing Surveys, vol. 41, pp. 1-58, July 2009.
V. Hodge and J. Austin, "A Survey of Outlier Detection
Methodologies," Artificial Intelligence Review, vol. 22, pp. 85126, October 2004.
V. Barnett and T. Lewis, Outliers in Statistical Data, New York: John
Wiley & Sons, Inc.,, 1994.
Bifet, A.,Frank E, Holmes,G., Pfahringer,B.,"Accurate Ensembles for
Data Streams Combining Restricted Hoeffding Trees Using Stacking," ,
Proc 2nd Asian Conference on Machine Learning, Tokyo., Journal of
Machine Learning Research,. pp., 225-240, 2010.
Bifet, A., Kirkby,R. Kranen, P, and Reutemann, P. "Massive Online
Analysis" , Technical Manual, University of Waikato, Hamilton, 2013,
New Zealand.
Bifet, A and Kirkby, R."Data stream mining: A Practical Approach",
Technical report, The University of Waikato, Hamilton, New Zealand.
Bifet, A.,Frank E, Holmes,G.., Pfahringer,B.,"MOA: Massive Online
Analysis" , Journal of Machine learning Research, pp.,1601-1604, 2011.
Bifet, A. Holmes,G, Pfahringer,B., Kirkby,R., and Gavaldà, R. "New
ensemble methods for evolving data streams," Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and
data mining, pp.,139-148,2009, ACM.
Bifet,A, and Gavaldà, R. "Adaptive learning from evolving data
streams," Advances in Intelligent Data Analysis VIII,pp., 249-260, 2009,
Springer, Berlin Heidelberg.