Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Malaya J. Mat. S(2)(2015) 361–365 An Efficient Outlier Detection Using Amalgamation of Clustering and Attribute-Entropy Based Approach Jasmine Natchial .Fa, ,∗ Elango Parasuramanb and Taslina. Bc a b PhD. Scholar, Bharathiar University, Coimbatore, India. Assistant Professor, Department of IT, PKIET, Karaikal, India. c M.E Scholar, Dept of CSE, BCET, Karaikal, India. Abstract Many organizations today have more than very large databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities but also new challenges. Thus data stream has become a dynamic research area of Data mining. Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. The data stream is motivated by emerging applications like Consumer click streams and Telephone records, Bulky set of web pages, Multimedia data and so on. Outlier detection in streaming data is a very challenging problem because of the reality that data streams cannot be scanned multiple times. The outliers may exert undue influence on the results of statistical analysis. So they should be identified using reliable detection methods prior to performing data analysis. The main objective of this proposed research work is to detect outliers using amalgamation technique where one or more techniques are combined for efficient anomaly detection. Keywords: Outlier, Amalgamation, Cluster-based, Attribute-Similarity based. c 2012 MJM. All rights reserved. 2010 MSC: 34G20. 1 Introduction A Data Stream is an enormous sequence of data elements continuously generated at a fast rate. In data streams, huge amount of data continuously inserted and queried, such data has very large database. It is motivated by emerging applications with massive data sets. In recent years, we have observed that enormous research activity motivated by the explosion of data collected and transferred in the format of data streams. So data stream has gained importance and has now become an extensively studied field of research area. An outlier is an object or a set of data in an observation or a point that is found to be dissimilar or inconsistent or do not comply with the general behavior or model of the remaining data. Depending upon different application domains these abnormal patterns are often referred to as outliers, anomalies, discordant, observations, faults, exceptions, defects, aberrations, errors, noise, damage, surprise, novelty, peculiarities or impurity. Attempts to eliminate them altogether result in the loss of important hidden information as one persons noise could be another persons signal. In many cases outliers are more interesting than normal cases for example network intrusion detection, credit card fraud detection, weather prediction, detecting outlying cases in medical data, marketing and customer segmentation, etc., Outliers may be erroneous or real in the following sense. Real outliers are observations where actual values are very different than those observed for the rest of the data and violate plausible relationships among variables. Erroneous outliers are observations that are distorted due to misreporting errors in the ∗ Corresponding : author. 362 Jasmine Natchial et al. / An Efficient Outlier Detection... data collection process. Outliers are produced by following three causes: (i) Data caused by inherent changes, (ii) Erroneous data as a result of execute error such as manual entry errors, hackers brake, and equipment failure, (iii) Data that fall into wrong classes. 2 Related Study Over the years, a large number of techniques have been developed for building models for outlier and anomaly detection. However, the real world data sets, data streams present a range of difficulties that bound the effectiveness of the techniques. The Classical technologies of outlier detection can be categorized as following: Statistic-based methods [5], Depth-based [16], Distance-based methods [6, 7], Clustering-based methods [10], Deviation-based method [11, 12], Density-based [3, 8, 9] methods and dissimilarity-based [13] or Similarity-based [4] methods. The statistical outlier detection techniques are essentially model-based techniques and are suited to quantitative real valued data sets or ordinal data distributions. A data instance is declared as an outlier if the probability of the data instance to be generated by this model is very low. They are based on statistical estimates of unknown distribution parameters [14, 15] and here lies their limitation. In the definition of depth-based, data objects are organized in convex hull layers in the data space according to peeling depth, and outliers are expected with shallow depth values. As the dimensionality increases, the data points are spread through a larger volume and become less dense. This makes the convex hull harder to discern and is known as the “Curse of Dimensionality”. The distance-based methods rely on the measure of full dimensional distance between a point and its nearest neighbor in the data set. The careful selection of suitable parameters for is the major drawback of this method. Cluster analysis is popular unsupervised techniques to group similar data instances into clusters. This involves a clustering step which partitions the data into groups of similar objects. Here the outlier either belong to very small clusters or do not belong to any cluster or are forced to belong to a cluster where they are very different from other members. A major limitation of this approach is that they require multiple passes to process the data set. The deviation–based methods identify outliers by examining the main characteristic of objects in a group instead of by applying statistical tests or distance based measurement. Objects that deviate from the given descriptions are considered outliers. This method has perfect performance but the hypothesis of exception is too idealization. Density-based detection estimate density distribution of a data point within data set and compares the density around a point with the density around its local neighbor. The relative density of a point compared to its neighbors is computed as an outlier score and points which are having a low density is considered as an outlier. In dissimilarity-based method, the outlier is detected on the basis of the percentage with which the data objects vary from the other or objects which are very dissimilar to the other data objects in some data set. In Similarity-based approach the outlier detection focuses on finding the similarity coefficient and object deviation degree. Above classical methods have respective advantages in application, but they all have some limitation in certain aspects. To solve this problem, recently, amalgamation of techniques for outlier detection is proposed and has gained more attention in recent years. These hybrid approaches combine or merge two or more techniques for efficient anomaly detection. This paper suggests amalgamation of clustering based method with the similarity-based approach. The rest of this paper is organized as follows. Section 3 explains the amalgamation procedure followed by proposed algorithm in Section 4 and Conclusion in Section 5. 3 Formal Definition This paper proposes a two phase method to detect outliers. (A). The first phase groups the data into clusters using the Euclidean distance and (B). The Second phase constructs similarity matrix and finds the Object Deviation Degree. All the attributes are taken into account in the Object Deviation Degree, the larger the value, the greater the possibility of the object being outlier, and vice versa. Jasmine Natchial et al. / An Efficient Outlier Detection... 3.1 363 Clustering Algorithm A prototype based, simple partition clustering technique called K-Means clustering is used here. This algorithm attempts to find a user specified k number of clusters represented by their centroids. A cluster centroid is typically the mean of points in the cluster. There are two stages in this algorithm: First stage: Selection of k centres randomly where k is fixed in advance. Second stage: Assignment of data objects to the nearest centre. Euclidean distance is used to determine the distance between each data object and the cluster centres. When all the data objects are included in some clusters, recalculation is done on the average of the clusters. This iterative process continues repeatedly until the criterion function becomes minimum. 3.2 Similarity - Matrix Computation Algorithm Given the input data set belonging to a particular cluster, first Attribute similarity Coefficient, Attribute Entropy can be calculated. Next step is to construction of Attribute Entropy Matrix. Finally with the help of this computed matrix, the object entropy can be easily calculated in turn the Maximal Attribute Entropy and the Object Deviation Degree can be deduced. The larger the deviation degree is, the greater the possibility of the object being an outlier. 4 Proposed Algorithm Input: Data Set D = (U, A), U stands for the object set, U = ui |i ∈ L, L = 1, 2, ., m, A stands for the attribute set, A = aj| j ∈ S, S = 1, 2, ., n. Cluster centre C= c1, c2 ck, where ci =cluster centre, k= no. of clusters. Output: The outlier. Step 1: Compute the distance of each data points um and the cluster centres ck. Step 2: Assign each data object ui to the cluster with closest centroid ci. Step 3: For each data object ui compute its distance from the centroid ci of the nearest cluster. Step 4: If the calculated distance is less than or equal to previous calculated distance then the data objects stay in that cluster itself or else calculate the distance to each of the new cluster centres and assign the data object to the nearest cluster. Step 5: Repeat the above two steps until no new centroids are found or a convergence criteria is met. Step 7: Consider each cluster as a separate Data set and compute the similarity coefficient of each data point of the data set. Step 8: Compute the attribute entropy and construct the attribute entropy matrix from which the object entropy can be deduced. Step 9: Compute Object deviation degree and compare with pre-set threshold. Step 10: Larger the object deviation degree, greater the possibility of the data point to be an outlier. 364 Jasmine Natchial et al. / An Efficient Outlier Detection... Thus from each cluster, outlier can be efficiently ruled out. 5 Conclusion The research is pursued taking into study the different existing conditions found in practice which is represented. The key objective of the proposed work is to merge the clustering based outlier detection method for the streaming data with attribute entropy based approach. The method applied both the clustering method and attribute entropy method for detection of group and individual outliers. The proposed method needs to be implemented on varying datasets and is found that this proposed method is well suited for Outlier Detection in Data Stream Mining. Future work requires approach applicable for categorical and mixed data sets. In order to achieve accurate results and faster computation, it might have overlooked or ignored some superior points which could have made the model more sophisticated but it would have led to computational difficulty. References [1] S.Vijayarani, P. Jothi, “An Efficient Clustering Algorithm for Outlier Detection in Data Streams in International Journal of Advanced Research in Computer and Communication Engineering, Vol. 2, Issue 9, September 2013. [2] N.V. Kadam, A. G. Kadu, M. R. Bhongade , A Modern Way to Detect Outliers by Merging Two Approaches i.e. Cluster Based and Distance Based. [3] Romaswany S, Rastogi R, Shim K, “Efficient Algorithms for Mining Outliers from Large Data Set,” Proc. of the ACM SIGMOD International Conf. on Management of Data, Texas: ACM Press, 2000, pp.473–478. [4] Ming- jian Zhou, Jun- cai Tao ,An Outlier Mining Algorithm Based on Attribute Entropy, In the Procedia Environmental Science 11 (2011) 132–138. [5] Barnet V, Lewis T, “Outliers in Statistical Data,” New York: John Woley & Sons,1994. [6] Bay S, Schwabacher M, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. of ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining. Washington, 2003, pp.29–38. [7] Knorr E, Ng R, “Finding Intentional Knowledge of Distance-Based Outliers,” Proc. of the VLDB Conf. Edinburgh: Morgan Kaufmann Publishers,1999, pp.211–222. [8] Mao Junguo, Duan lijuan, Wang Shi, Shi Yun, “Data Mining Principle and Algorithm,” Tsinghua University Press, 2007. [9] Breuing M, Kriegel H, Ng R, Sander J, “LOF: Identifying Density-Based Local Outlier,” Dallas, Texas: Proc. of ACM SIGMOD Conf., 2000, pp.94–104. [10] Han Jiawei,Kamber M, “Data Mining: Concepts and Techniques,” New York: Morgan Kaufmann Publishers, 2001. [11] Yu Zhongqing, Fang Yi, Pan Zhenkuan, Shao Fengjing, “OLPA Architecture,” Qingdao-Hong Kong international Computer Conf., 1999, 10. [12] Arning A, Agrawal R, Raghavan, “A Linear Method for Deviation Detection in Large Database,” Proc. of the KDD Conf., Portland: AAAI Press, 1996, pp.164–169. [13] Ming-jian Zhou, Xue-jiao Chen, An Outlier Mining Algorithm Based on Dissimilarity, in the Procedia Environmental Sciences 12 (2012) 810–814, 2011 International Conference on Environmental Science and Engineering. [14] Hawkins D., Identification of Outliers, Chapman and Hill 1980. Jasmine Natchial et al. / An Efficient Outlier Detection... 365 [15] Barnett v., Lewis t., Outliers in Statistical Data, John Wiley, 1994. [16] Abishek B, Manker , Namrata Ghuse, A Review on Detection of Outliers Over High Dimensional Streaming Data Using Cluster Based Hybrid Approach, International Journal of Science and Research. [17] Behera H.S., Abishek Ghosh, Sipak ku. Mishra, A New Hybridised K-Means Clustering Based Outlier Detection Technique For Effective Data Mining. [18] Murugavel P., Punithavalli M., Improved Hybrid Clustering and Distance-based Technique for Outlier Removal, International ournal on Computer Science and Engineering. [19] Surekha V Peshatwar, Snehlata Dongre, Outlier Detection Over Data Stream Using Cluster Based Approach And Distance Based Approach, International Conference On Electrical Engineering and Computer Science. [20] Garima Singh, Vijay Kumar , An Efficient Clustering and Distance Based Approach for Outlier Detection, International Journal of Computer Trends and Technologgy Vol.4 Iss.7-July 2013. Received: August 18, 2015; Accepted: September 10, 2015 UNIVERSITY PRESS Website: http://www.malayajournal.org/