Download Algorithm Design and Comparative Analysis for Outlier

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 6, Issue 3, July - Sept 2015 Algorithm Design and Comparative Analysis for Outlier Detection using Genetic Algorithm and Bacterial Foraging 1 Dr. Amit Verma, 2Er.Parminder Kaur, 3Sharnjeet Kaur 1 Professor and Head, Dept. of CSE, UIE, CU Assistant Professor, Dept. of CSE, UIE, CU 3 Research Scholar, Dept. of CSE, UIE, CU 2 Abstract Data mining derive its name for searching necessary information from a large database and utilizes this information in better way. To arrange the data in proper manner there is a need of clustering and classification techniques. With respect to other algorithms K-Mean clustering algorithm attains attention to arrange the data in clusters . The optimization algorithms such as Genetic algorithm and Bacterial Foraging Algorithm used to optimize the result of K-Mean clustering algorithm and detect outliers from the result of K-Mean Clustering algorithm. In this paper a hybrid genetic and bacterial foraging algorithm is proposed to find outliers from the data Keywords K-mean, Genetic algorithm, Outlier Detection, Bacterial Foraging Algorithm. I. Introduction Dividing objects in meaningful groups of objects or classes based on common characteristics helps to describe prosperities and analyze these properties to describe new patterns. We can say clusters are classes, and cluster analysis is a process of finding these classes to arrange the data properly. In descriptive task data is independent due to which it is needs to be post-processed. Descriptive task comes under unsupervised learning in which data can be classified without using class labels. Clustering comes under unsupervised learning as classification is supervised learning. The data in one group is similar and different from the data that is other group. Clustering helps to manage the information according to their properties in clusters. The size of the cluster depends on the elements present in one cluster. K-Mean clustering is used to cluster the elements in proper way. The arrangement of elements in K-Mean clustering depends upon the centriod. K-mean clustering algorithm are attains popularity due to their high performance. K-Mean algorithm can work on large datasets. There is one disadvantage of K-mean algorithm that it is sensitive to noise. To cover this problem Genetic algorithm is used with K-Mean algorithm to detect outlier to improve the quality of data. Genetic algorithm is search based algorithm that is inspired from human activities. Genetic algorithm is used for large datasets where is more number of solution. At initial stage the performance of genetic algorithm is poor because solutions are randomly generated solutions. At end of process the genetic algorithm gives the possible set of solution. Crossover, mutation, fitness function are the main components of genetic algorithm according to which it works. Genetic algorithm generates the solution with the help of chromosomes. The new solution is generated from parents with the help of crossover operator. The genetic information is interchanged in parent that leads to child solution. Mutation operator is used for variations. After optimizing of data genetic algorithm also removes outliers. The data that is detected by genetic algorithm as outliers also contain information. To filter this information the hybrid genetic and bacterial foraging algorithm is proposed. 48 International Journal of Computer Science And Technology II. Related Work D.M. Hawkins, “Identification of Outliers” (1980) used genetic algorithm to detect known multiple outliers from the regression dataset. Experiment shows that the genetic algorithm was successful in finding the outliers from multiple dataset. Ujjwal Maulik et.al “Semi_Automated Feature Extraction using Simulated Annealing (2000) talked about genetic algorithm for pattern reorganization. Genetic algorithm is a search based algorithm provides the optimized value. Initially it selects the randomly from the population to reach at best and optimized result. Aggarwal C. et al. (2001) talked about a brand new means for outlier detection that’s perfect for quite high dimensional data sets. Using this method detects reduced dimensional projections, which usually are not simply discovered by means of brute force. The offered process has advantages over basic distance based outliers. Simple distance based outlier detection process cannot overcome as well as reduce the result of the dimensionality curse. Williams G et.al (2002) used RNN method to detect outliers. Experimental results are calculated on both smaller dataset and larger data set. Basically larger data sets are used to check scalability and used to provide practical apps. Bakar Z. et al (2006) talk about the functionality of control chart with the help of different outlier detection techniques. This paper discusses that which outlier detection technique for making control charts. Comparison is also made with linear regression technique for making control charts. Elahi, M et.al (2008) a clustering based strategy, which usually partition the flow throughout sections and also cluster every single portion making use of k-mean throughout predetermined quantity of clusters is launched. Many experiments in distinct dataset concur that his or her process can discover far better outliers together with lower computational charge as opposed to additional getting out distance based methods involving outlier detection throughout data stream. N. Bansal et al. (2010) defined that clustering is most effective technique for outlier detection. In this many algorithms are used to detect the outlier like PAM, CLARA, CLARANS, and ECLARANS and compared the result of these in terms of time complexity and proposed a new solution by adding fuzziness. ECLARANS technique takes less time amongst them all. Alckmin D. et al. (2012) talked about a hybrid genetic k-means algorithm (HGKA). HGKA includes some great benefits of FGKA and also IGKA and also functions pretty much as soon as value involving mutation likelihood is actually tiny as well as big. This particular GA based clustering approaches making use of k-means clustering and it can easily just manage numeric data sets. Pachgade S. et al. (2012) talked about a hybrid way of find the outlier really powerful way. The combination of cluster based and also distance based strategy is needed. Bunch based strategy, first of all splits the dataset making use of k-mean then the distance based strategy is needed to uncover the outlier in between these individuals if any. Recommended process usually takes much less w w w. i j c s t. c o m IJCST Vol. 6, Issue 3, July - Sept 2015 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) computational charge and also functions pretty much as compared to which from the distance based process. Using this method will save large amount of further information. Hsuan-Ming Feng, Ji-Hwei Horng (2012) describes the bacterial foraging optimization algorithm with particle swarm optimization algorithm. This technique is used for better image compression which is based on vector quantization. This paper also defines that the bacterial foraging particle swarm optimization technique is evolutionary learning technique that helps to manage codebook problems. The combination of both the BFO and PSO also gives the fast and reliable results in less time. The quality of image is also improved by this technique.Computer simulation results of nonlinear image compression applications demonstrate the efficiency of the BFPSO learning algorithm. The differences between the proposed BFPSO learning scheme and the BFOand LBG-based VQ learning methods demonstrate the superior image results produced by the proposed algorithm N.Lekhi et al. (2014) described that outlier detection on data by making the clusters of data. Outlier can be reduced if we improve the clustering. So in this Hybrid Algorithm is proposed that is combination of clustering and classification techniques i.e. weighted k-mean and neural networks and this algorithm work on the numeric and text data. By this way cluster is improved and outlier be reduced and also increase the classification accuracy. III. Clustering Algorithm 1. Data is arranged in different ways by using different clustering techniques. Clustering techniques are of many types such as: A. Hierarchical Clustering Hierarchical clustering is basically a decomposition technique. This algorithm divides or combines hierarchical structure which is also known as dendrogram. It is further divided into two types 1. Agglomerative Clustering-In this type of clustering the algorithm proceeds from bottom to top. It combines the two clusters with minimum distance. 2. Divisive clustering-In this type of clustering the algorithm proceed from top to bottom. A large cluster is divided into sub clusters. B. Density Based Clustering It is based on density of objects. Density based clustering is used to managed irregular shaped objects. C. Grid Based Clustering Grid based clustering algorithm are used for spatial data. It includes structure of objects in space. The data is clustered into cells. Various algorithms under grid based clustering are:- algorithm is based on medoids. 2. K-Mean Algorithm This algorithm is based on the mean value of the objects in the cluster.n IV. Genetic Algorithm Genetic algorithm (GA) is a stochastic seek strategy that will mimics the actual healthy advancement offered by simply Charles Darwin throughout 1858. GA retains the individual connected with n chromosomes (solutions) with connected fitness valuations. The full course of action is actually identified as uses: moms and dads are selected to companion, on the basis of their fitness, making young by using a reproductive system approach. As a result very healthy alternatives are shown a lot more prospects to duplicate, making sure that young inherit attributes from each and every parent or guardian. As parents mate and pro- duce offspring, room must be made for [12] the new arrivals since the population is kept at a static size. Individuals in the population die and then, are replaced by the new solutions, eventually creating a new generation once all mating opportunities in the old population have been exhausted. In this manner, it really is estimated that will in excess of effective many years, much better alternatives will thrive as the lowest healthy alternatives pass away available. Right after an initial population is actually randomly created, the actual algorithm advances by means of a few operators: 1. Selection Gives preference to better individuals, letting them offer their genetics to another technology. This amazing benefits of every specific is dependent upon the fitness. Fitness may be driven by an objective function or perhaps by way of a very subjective judgment. 2. Crossover Represents mating concerning individuals; a couple individual are chosen in the population using the selection operator. Any crossover site over the bit strings is actually randomly picked. The significance of every string is actually exchanged up to this point. Each new young offspring made from this specific mating are placed in your next generation from the population. By recombining portions connected with great individuals, this method probably will develop better individuals. 1. Clique Algorithm It is a combination of both grid based and density based clustering. It is used for clustering high dimensional data in large database. 3. Mutation Features random alterations. The purpose of mutation is always to keep diversity within the population and prevent uncontrolled convergence. With a number of low chance, a smallportion of the new individuals will have some of their bits dipped. Mutation as well as assortment (without crossover) develop parallel, noisetolerant, hill-climbing algorithms. This pseudo rule connected with genetic algorithms is actually found beneath [17]. D. Partitioning Algorithm This method is based on partition of data. In this type of clustering the data is partitioned into groups or clusters. The objects in one cluster are similar with each other but dissimilar from the objects in other cluster. V. Bacterial Foraging Algorithm BFO Algorithm is an optimization algorithm based on the behavior of E.coli bacteria in the body. It is basically used for solving optimization problems. Basically it involves two stages run and tumble that help bacteria looking for food to survive ever. 1. K-Medoids Algorithm In this algorithm basically every cluster is represented by one of the object which located near the center of the cluster. This w w w. i j c s t. c o m International Journal of Computer Science And Technology 49 IJCST Vol. 6, Issue 3, July - Sept 2015 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) VI. Proposed Algorithm the outlier data that is detected by K-Mean genetic algorithm. Hybrid genetic and bacterial foraging algorithm also detects proper outliers that degrade the performance of system. In future we can use more NIS Algorithm to detect outlier to achieve more accurate result. In medical study this work is extended to detect genes belong to same category by detecting those genes which does belong to them. In networking instead of K-Mean algorithm dikstra’s algorithm is used to check the correct nodes connected with each other. Algorithm Genetic_BFO Begin For dataset Di { Upload the complete components } For (Di =1, Di< Df , Di++) { Extract Feature set Fi where Fi € Di { Do initialize clusters using k-mean algorithm. } } For each (Fi =1, Fi< Ff, Fi++) { Apply genetic Load genetic configurations } For Fg { Evaluate fitness function For each (Fg=C, Fg<TValus function< Fg--) { If (Data processing = TRUE) { Arrange the elements into clusters and Save the outlier data } Upload outlier data in bacterial foraging optimization algorithm. If( condition set by bacterial foraging is satisfied by upload data =True) { Classified data } Else save data as outlier End VII. Conclusion Clustering is an unsupervised technique that handles large data in efficient way into different clusters. Amongst the various clustering techniques K-Mean clustering Algorithm is a portioning algorithm with high performance. Outliers are the raw data that are present in data in the form of numeric values, tags, link, and addresses. Detection of outlier is important because they degrade the quality and performance of data. In this research work the hybrid Genetic and Bacterial foraging optimization algorithm are used to detect the outliers from data. Genetic algorithm provides the optimized result that is further processed by bacterial foraging algorithm to detect raw data. This research work also presents the comparison between K-Mean genetic algorithm and hybrid genetic and bacterial foraging algorithm. This research works shows that K-Mean genetic algorithm detects important information as outliers. While hybrid genetic algorithm and bacterial foraging algorithm removes that problem it filters the important data from 50 International Journal of Computer Science And Technology References [1] Aggarwal C., Philip S. yu,“Outlier Detection for High Dimensional Data”, ACM SIGMOD Record, Vol. 30, Issue 2, June 2001, pp. 37 – 46. [2] Williams, G.; Baxter, R. Hongxing He. Hawkins, S. Lifang Gu,“A comparative study of RNN for outlier detection in data mining”, Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on 2002, [4] Bakar Z,“A Comparative Study for Outlier Detection Techniques in Data Mining”, 2006 IEEE. [5] Ke Wang, Philip S. Yu,“Bottom-Up Generalization: A Data Mining Solution to Privacy Protection”, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04), IEEE. [6] Mantas Paulinas, Andrius Ušinskas,“A Survey of Genetic Algorithms Applications for Image Enhancement and Segmentation”, 2007, Vol. 36, No. 3. [7] Aggarwal, Ed,"Data Streams–Models and Algorithms”, Springer, 2007. [8] B.K. Panigrahi, V. Ravikumar Pandi,“Bacterial foraging optimisation: Nelder–Mead hybrid algorithm for economic load dispatch”, IET Gener. Transm. Distrib, Vol. 2, No. 4, pp. 556-565, 2008. [9] Elahi, M.; Kun Li; Nisar, W.; Xinjie Lv; Hongan Wang, “Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream”, Fuzzy Systems and Knowledge Discovery, 2008. FSKD ‘08. Fifth International Conference on (Vol. 5), pp. 18-20 Oct. 2008. [10]Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg,“Top 10 algorithms in data mining”, knwl inf syst, Vol. 14, pp. 1-37, 2008. [11] H. Shen, Y. L. Zhu, X. M. Zhou, H. F. Guo, C. G. Chang, “Bacterial foraging optimization algorithm with particle swarm optimization strategy for global numerical optimization”, World Summit on Genetic and Evolutionary Computation, Shanghai, China, pp. 497-504, June 2009. [12]Chen Han-ning, Zhu Yun-long, Hu Kun-yuan,“Cooperative bacterial foraging algorithm for global optimization”, Chinese control and decision conference, Guilin, China, 2009, 6: pp. 3896-3901. [13]N.Bansa,“Differentiate Clustering Approaches for Outlier Detection”, International Journal of Innovative Research in Computer and Communication Engineering, pp. 193-196, Vol. 1, 2010. [14]Velmurugan T., Santhanam T.,"Performance Evaluation of K-Means and Fuzzy C-Means Clustering Algorithms for Statistical Distributions of Input Data Points”, European Journal of Scientific Research, Vol. 46, No. 3, pp. 320-330, 2010. [15]Salwa K. Abd-El-Hafiz.,“A Metrics-Based Data Mining Approach for Software Clone Detection”, IEEE 36th w w w. i j c s t. c o m ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 6, Issue 3, July - Sept 2015 International Conference on Computer Software and Application, 2012. [16]Alckmin D. et al,“Hybrid Genetic Algorithm Applied To the Clustering Problem”, Revista Investigacion Operacional. Vol. 33, No. 2, pp. 141-151, 2012. [17]S.D Pachgade, S.S Dhande,“Outlier Detection over Data set Using Cluster Based and Distance Based Approach”, International Journal of Advance Research 64. w w w. i j c s t. c o m International Journal of Computer Science And Technology 51

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Algorithm Design and Comparative Analysis for Outlier