Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Conceptions on Computing and Information Technology Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808 Outlier Mining in Data Streams Using Massive Online Analysis Framework Prof. Dr. P K Srimani Malini M Patil Former Director, R & D Bangalore University Bangalore, India [email protected] Assistant Professor, Dept. of ISE, J S S Academy of Technical Education, Bangalore, India [email protected] Abstract— Outlier mining for data streams is completely different from that for traditional datasets. An outlier is a data point which significantly conforms well to a defined abnormal behavior and is application dependent. Any data mining technique can learn the pattern from the dataset and then compares every data point to the pattern to detect outliers. The advancement of the technology has led the large flow of data in the digital form. Data generated by applications like sensor network, web-click monitoring, network traffic monitoring, etc. are huge and have large data distributions. Such data are referred to as data streams. Outlier mining is totally different for data streams because the entire dataset is never available due to their ubiquitous nature. In such cases outlier detection is a very challenging research issue. Present work aims at mining outliers from data streams using Massive Online Analysis (MOA) frame work using distance based algorithms. The algorithms used to mine the outliers are simple continuous outlier detection(Simple COD) algorithm and micro cluster based continuous detection (MCOD) algorithm. Both algorithms are compared with different sizes (5000, 10000, 15000, 20000, 25000, 30000) of data sets. A comparative study of both the algorithms is conducted and the results are found to be very interesting. Keywords- Data streams, Simple COD, MCOD, Massive online Analysis, Outliers, Inliers I. INTRODUCTION Outlier detection(mining) is also termed as anamoly detection. It is one of the important task of data mining[1].The task aims at discovering the outliers, which are some specific patterns that show a significant unexpected behavior. Few of the applications in which outliers can be considered as important elements are fraud detection, network monitoring systems, sensor networks and many more. Outliers may appear in a dataset for numerous reasons, like malicious activity, instrumental error, setup error, changes of environment, human error, catastrophe, etc. Regardless of the reason, outliers may be interesting and/or important to the user because of their diverse nature compared to normal data points. Some people define outliers as problems, some people define them as interesting items, but in any case, they are unavoidable. They also are addressed by different names as abnormalities, discordants, deviants or anomalies in the data mining and statistics literature. In [2] The author defines outliers as an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. then sometimes there arises a question "why to detect an outlier?" The following reasons are the answers for the above questions. In a network monitoring system, Data is collected from heterogeneous sources. Because of some malicious attack the data may show unusual behavior. To detect such behavior outlier analysis is necessary. Another important area of outlier analysis is patient disease diagnosis. Patients are advised to undergo different types of diagnose procedures like MRI, ECG, C.T.SCAN etc, These diagnose procedures are conducted with different devices. Based on the report of such tests, the patient can be diagnosed. Unusual patterns in such data, effectively show different types of disease conditions Similar examples can be quoted from spatial data and cyber data. Outlier mining in data streams[3,4,5] is a very challenging task because of their ubiquitous nature. The technique should address many research issues related to handling data streams. They are execution time, uncertainty, concept drift, arrival rate, dimensionality, usage of memory etc. The present work aims at performing outlier mining using massive online analysis framework using distance based algorithms. Outliers can be classified into three major categories as follows. Type I Outliers- An isolated individual data point in a dataset is termed as a Type I outlier. By definition they are the simplest type and it is very easy to identify them. Intuitively they are far from other data points in the dataset in terms of attribute values. Type II Outliers- A data point that is isolated with respect to other data points in the context is called a type II outlier. Type III Outliers- A particular group of data points that appear as outliers with respect to the entire dataset is termed type III outliers. No data point in a small subset is an outlier with respect to the other points in the subset, but as a group, they are outliers. The rest of the paper is organized as follows: section II is about related work; methods and models are discussed in section III; Experiments and results are discussed in section IV; Conclusion and future work are discussed at the end. 33 | 9 5 International Journal of Conceptions on Computing and Information Technology Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808 II. RELATED WORK In the recent past outlier mining is considered as a very challenging research area. Outlier detection for data streams is a new area of research compared to the long history of outlier detection in statistical data [14,15].The distance based outlier mining is proposed by [6]. Density based clustering over an evolving stream with noise is proposed in[7]. Novel method is proposed by [8] with regard to queries in data streams using distance based approach. Algorithms and different methodologies for mining distance based outliers is proposed in [9]. Extensive work on continuous monitoring of distance based outliers over data streams is presented in [10]. The state of art of algorithms like COD, MCOD to detect outliers are proposed by [11]. Supervised approach of outlier mining can be found in [12] where as unsupervised in [13]. III. METHODS AND MODELS This section mainly emphasis on the framework used in outlier mining, data stream generator, the configuration set up and about the algorithms used to detect the outliers. A. Massive Online Analysis Framework(MOA) Massive online analysis (MOA) framework [16,17,18,19,20,21] is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed in such a way that it can handle the challenging problems of data streams. The state of the art algorithms are implemented in the framework. They are also scaled up to the real world data sets. MOA consists of offline and online algorithms for classification, clustering, outlier mining and regression modeling. It also consists of tools for evaluation. Thus MOA is an open source frame work to handle massive, potentially infinite, evolving data streams. MOA mainly permits the evaluation of data stream learning algorithms on large streams under explicit memory limits. The outlier mining algorithm set up mainly consists of the following steps. viz., i) Select the stream ii)Select algorithm 1 iii) Select algorithm 2. Visualization window mainly displays behavior of the selected algorithms for a specified number of instances. the An initial configuration model for outlier mining is shown in the fig.1. B. Algorithms used in the Outlier detection For the purpose of experimental set up the algorithms [11] used to mine the outliers are simple continuous outlier detection(Simple COD) algorithm and micro cluster based continuous detection (MCOD) algorithm. The improved efficiency of COD (Continuous Outlier Detection) stems from the adoption of an event-based approach. Instead of checking each object continuously, the algorithm computes the next time point in the future when, due to object departures, an object may become an outlier and inspects an object only at that time point. MCOD[11] (Micro-cluster-based Continuous Outlier Detection) builds on top of COD and employs the same event queue. Its distinctive characteristic is that it mitigates the need to evaluate range queries for each new object with respect to all other active objects. The solution is based on the concept of evolving micro-clusters that correspond to regions containing inliers exclusively. Then the range queries for each new object are performed with respect to the (fewer) micro cluster centres instead of the preceding active objects. In realistic data with few outliers and dense regions, MCOD exhibits the best performance. Both COD and MCOD have been implemented in the extended MOA. C. Data stream generator used in the study. RANDOMRBF-Generator Generates a random radial basis function(RBF), introduced by [16]. This generator was devised to offer an alternate complex concept type that is not straightforward to approximate with a decision tree model. The RBF generator works as follows: A fixed number of random centroids are generated. Each centre has a random position, a single standard deviation, class label and weight. New examples are generated by selecting a centre at random, taking weights into consideration so that centres with higher weight are more likely to be chosen. A random direction is chosen to offset the attribute values from the central point. The length of the displacement is randomly drawn from a Gaussian distribution with standard deviation determined by the chosen centroid. The chosen centroid also determines the class label of the example. This effectively creates a normally distributed hyper sphere of examples surrounding each central point with varying densities. Only numeric attributes are generated. IV. EXPERIMENTS AND RESULTS The experiments are conducted in Massive Online Analysis Framework. The Data stream used for the analysis is RANDOMRBF generator. The varying stream sizes selected are 5000, 10000, 15000, 20000, 25000, 30000 respectively. Number of cluster size is 5. The algorithms used in the experiments are simple continuous outlier detection(SimpleCOD) algorithm and micro cluster based continuous detection(MCOD)algorithm. The statistics are tabulated in table 1 and table 2. Other results are shown in the visualization window as shown in fig 2 and 3 are self explanatory. Fig. 1 Configuration of outlier Mining in MOA framework 34 | 9 5 International Journal of Conceptions on Computing and Information Technology Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808 Table 2.Statistics of MCOD algorithm BOTH MMU TPT NO. OF INLIER OUTLIER INSTANCES NODES NODES INLIER & OUT LIER 5000 4331 386 283 107 2.14 10000 8972 719 309 133 3.77 15000 13558 1105 337 191 6.3 20000 18148 1503 349 158 8.65 25000 22711 1929 360 194 10.69 30000 27315 2315 370 221 316.82 (MB) (ms) Fig 2. Results of Outlier Mining in MOA framework Fig. 3 Graph of Evaluation Measures Vs Instance Size for SimpleCOD algorithm Fig 3. Results of Outlier Mining in MOA framework Table 1.Statistics of SimpleCOD algorithm T BOTH MMU T (MB) s) 283 107 36.16 719 309 133 78.89 13558 1105 337 191 127.52 20000 18148 1503 349 158 169.83 25000 22711 1929 360 194 243.13 30000 27315 2315 370 221 NO. OF INLIER OUTLIER INSTANCES NODES NODES INLIER & OUT LIER 5000 4331 386 10000 8972 15000 Fig.4 Graph of Evaluation Measures Vs Instance Size for MCOD algorithm V. CONCLUSION The experiments are conducted in Massive Online Analysis 14.00 Framework. The Data stream used for the analysis is RANDOMRBF generator. The varying stream sizes selected are 35 | 9 5 International Journal of Conceptions on Computing and Information Technology Vol. 3, Issue. 1, April’ 2015; ISSN: 2345 - 9808 5000, 10000, 15000, 20000, 25000, 30000 respectively. Number of cluster size is 5. The algorithms used in the experiments are simple continuous outlier detection(SimpleCOD) algorithm and micro cluster based continuous detection(MCOD) algorithm. As per the tabulation of results in table 1 and table 2 it is found that memory management units (MMU) for both the algorithms is same for all the instances. The drastic variation is observed in the total processing time(TPT). MCOD takes less time to execute except for the instance size of 30000. SimpleCOD takes more time to execute except for the instance size of 30000. Statistics of inlier and outlier nodes remains same in both the algorithms. Finally, the present work establishes that apart from traditional data mining techniques , outlier mining is also possible in data streams under the framework of massive online analysis. [8] [9] [10] [11] [12] [13] [14] REFERENCES [1] [2] [3] [4] [5] [6] [7] Han, J. and Kamber, M.(ed.) "Data Mining : Concepts and Techniques," Morgon Kaufmann Publishers, 2007 , San Francisco, CA Hawkins Identification of outliers, Chapman and Hall 1980. Aggarwal, C.C. (Ed.),"Data streams: Models and Algorithms," Series: Advances in Database Systems, Vol. 31, XVIII, 354 p, 2007, ebook ,Springer, Berlin Heidelberg. Guha, S. , Koudas, N.K. and Shim, K. ,"Data Streams and Histograms, Proceedings of thirty-third annual ACM Symposium on Theory of Computing., 2003, pp., 471-475 , ACM Press. Domingos,P, and Hulten,G. "Mining time-changing data streams,"In KDD’00, Proceedings of the sixth ACM SIGKDD International conference on Knowledge discovery and data mining pp., 71-80, 2000, NY, USA doi:10.1145/347090.347107 ACM Press. Ramaswamy Sridhar, Rastogi Rajeev, Shim Kyuseok, ”Efficient algorithms for mining outliers from large data sets, ” Proceedings of the 2000 ACM SIGMOD international conference on Management of data, New York, NY, USA, pp. 427-438, 2000. F. Cao, M. Ester, W. Qian, and A. Zhou. "Density-based clustering over an evolving data stream with noise". In SDM, 2006. [15] [16] [17] [18] [19] [20] [21] 36 | 9 5 F. Angiulli and F. Fassetti. "Distance-based outlier queries in data streams: the novel task and algorithms". Data Mining and Knowledge Discovery, 20(2):290–324, 2010. E. Knorr and R. Ng. "Algorithms for mining distance-based outliers in large data sets". In VLDB, 1998. M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. Continuous monitoring of distance-based outliers over data streams. In ICDE, pages 135–146, 2011. Dimitrios Georgiadis, Maria Kontaki, Anastasios Gounaris, Apostolos Papadopoulos, Kostas Tsichlas and Kostas Tsichlas "Continuous Outlier Detection in Data Streams: An Extensible Framework and State-Of-TheArt Algorithms. B. Z. J. L. Naoki Abe, "Outlier Detection by Active Learning," SIGKDD, 2006 V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Surveys, vol. 41, pp. 1-58, July 2009. V. Hodge and J. Austin, "A Survey of Outlier Detection Methodologies," Artificial Intelligence Review, vol. 22, pp. 85126, October 2004. V. Barnett and T. Lewis, Outliers in Statistical Data, New York: John Wiley & Sons, Inc.,, 1994. Bifet, A.,Frank E, Holmes,G., Pfahringer,B.,"Accurate Ensembles for Data Streams Combining Restricted Hoeffding Trees Using Stacking," , Proc 2nd Asian Conference on Machine Learning, Tokyo., Journal of Machine Learning Research,. pp., 225-240, 2010. Bifet, A., Kirkby,R. Kranen, P, and Reutemann, P. "Massive Online Analysis" , Technical Manual, University of Waikato, Hamilton, 2013, New Zealand. Bifet, A and Kirkby, R."Data stream mining: A Practical Approach", Technical report, The University of Waikato, Hamilton, New Zealand. Bifet, A.,Frank E, Holmes,G.., Pfahringer,B.,"MOA: Massive Online Analysis" , Journal of Machine learning Research, pp.,1601-1604, 2011. Bifet, A. Holmes,G, Pfahringer,B., Kirkby,R., and Gavaldà, R. "New ensemble methods for evolving data streams," Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.,139-148,2009, ACM. Bifet,A, and Gavaldà, R. "Adaptive learning from evolving data streams," Advances in Intelligent Data Analysis VIII,pp., 249-260, 2009, Springer, Berlin Heidelberg.