Download Algorithm Design and Comparative Analysis for Outlier

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 6, Issue 3, July - Sept 2015
Algorithm Design and Comparative Analysis for Outlier
Detection using Genetic Algorithm and Bacterial Foraging
1
Dr. Amit Verma, 2Er.Parminder Kaur, 3Sharnjeet Kaur
1
Professor and Head, Dept. of CSE, UIE, CU
Assistant Professor, Dept. of CSE, UIE, CU
3
Research Scholar, Dept. of CSE, UIE, CU
2
Abstract
Data mining derive its name for searching necessary information
from a large database and utilizes this information in better way.
To arrange the data in proper manner there is a need of clustering
and classification techniques. With respect to other algorithms
K-Mean clustering algorithm attains attention to arrange the data
in clusters . The optimization algorithms such as Genetic algorithm
and Bacterial Foraging Algorithm used to optimize the result of
K-Mean clustering algorithm and detect outliers from the result
of K-Mean Clustering algorithm. In this paper a hybrid genetic
and bacterial foraging algorithm is proposed to find outliers from
the data
Keywords
K-mean, Genetic algorithm, Outlier Detection, Bacterial Foraging
Algorithm.
I. Introduction
Dividing objects in meaningful groups of objects or classes based on
common characteristics helps to describe prosperities and analyze
these properties to describe new patterns. We can say clusters are
classes, and cluster analysis is a process of finding these classes to
arrange the data properly. In descriptive task data is independent
due to which it is needs to be post-processed. Descriptive task
comes under unsupervised learning in which data can be classified
without using class labels. Clustering comes under unsupervised
learning as classification is supervised learning. The data in one
group is similar and different from the data that is other group.
Clustering helps to manage the information according to their
properties in clusters. The size of the cluster depends on the elements
present in one cluster. K-Mean clustering is used to cluster the
elements in proper way. The arrangement of elements in K-Mean
clustering depends upon the centriod. K-mean clustering algorithm
are attains popularity due to their high performance. K-Mean
algorithm can work on large datasets. There is one disadvantage
of K-mean algorithm that it is sensitive to noise. To cover this
problem Genetic algorithm is used with K-Mean algorithm to
detect outlier to improve the quality of data. Genetic algorithm
is search based algorithm that is inspired from human activities.
Genetic algorithm is used for large datasets where is more number
of solution. At initial stage the performance of genetic algorithm is
poor because solutions are randomly generated solutions. At end
of process the genetic algorithm gives the possible set of solution.
Crossover, mutation, fitness function are the main components of
genetic algorithm according to which it works. Genetic algorithm
generates the solution with the help of chromosomes. The new
solution is generated from parents with the help of crossover
operator. The genetic information is interchanged in parent that
leads to child solution. Mutation operator is used for variations.
After optimizing of data genetic algorithm also removes outliers.
The data that is detected by genetic algorithm as outliers also
contain information. To filter this information the hybrid genetic
and bacterial foraging algorithm is proposed.
48
International Journal of Computer Science And Technology
II. Related Work
D.M. Hawkins, “Identification of Outliers” (1980) used genetic
algorithm to detect known multiple outliers from the regression
dataset. Experiment shows that the genetic algorithm was
successful in finding the outliers from multiple dataset.
Ujjwal Maulik et.al “Semi_Automated Feature Extraction using
Simulated Annealing (2000) talked about genetic algorithm
for pattern reorganization. Genetic algorithm is a search based
algorithm provides the optimized value. Initially it selects the
randomly from the population to reach at best and optimized
result.
Aggarwal C. et al. (2001) talked about a brand new means for
outlier detection that’s perfect for quite high dimensional data sets.
Using this method detects reduced dimensional projections, which
usually are not simply discovered by means of brute force. The
offered process has advantages over basic distance based outliers.
Simple distance based outlier detection process cannot overcome
as well as reduce the result of the dimensionality curse.
Williams G et.al (2002) used RNN method to detect outliers.
Experimental results are calculated on both smaller dataset
and larger data set. Basically larger data sets are used to check
scalability and used to provide practical apps.
Bakar Z. et al (2006) talk about the functionality of control chart
with the help of different outlier detection techniques. This paper
discusses that which outlier detection technique for making control
charts. Comparison is also made with linear regression technique
for making control charts.
Elahi, M et.al (2008) a clustering based strategy, which usually
partition the flow throughout sections and also cluster every single
portion making use of k-mean throughout predetermined quantity
of clusters is launched. Many experiments in distinct dataset
concur that his or her process can discover far better outliers
together with lower computational charge as opposed to additional
getting out distance based methods involving outlier detection
throughout data stream.
N. Bansal et al. (2010) defined that clustering is most effective
technique for outlier detection. In this many algorithms are
used to detect the outlier like PAM, CLARA, CLARANS, and
ECLARANS and compared the result of these in terms of time
complexity and proposed a new solution by adding fuzziness.
ECLARANS technique takes less time amongst them all.
Alckmin D. et al. (2012) talked about a hybrid genetic k-means
algorithm (HGKA). HGKA includes some great benefits of FGKA
and also IGKA and also functions pretty much as soon as value
involving mutation likelihood is actually tiny as well as big. This
particular GA based clustering approaches making use of k-means
clustering and it can easily just manage numeric data sets.
Pachgade S. et al. (2012) talked about a hybrid way of find the
outlier really powerful way. The combination of cluster based and
also distance based strategy is needed. Bunch based strategy, first
of all splits the dataset making use of k-mean then the distance
based strategy is needed to uncover the outlier in between these
individuals if any. Recommended process usually takes much less
w w w. i j c s t. c o m
IJCST Vol. 6, Issue 3, July - Sept 2015
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
computational charge and also functions pretty much as compared
to which from the distance based process. Using this method will
save large amount of further information.
Hsuan-Ming Feng, Ji-Hwei Horng (2012) describes the bacterial
foraging optimization algorithm with particle swarm optimization
algorithm. This technique is used for better image compression
which is based on vector quantization. This paper also defines that
the bacterial foraging particle swarm optimization technique is
evolutionary learning technique that helps to manage codebook
problems. The combination of both the BFO and PSO also gives
the fast and reliable results in less time. The quality of image
is also improved by this technique.Computer simulation results
of nonlinear image compression applications demonstrate the
efficiency of the BFPSO learning algorithm. The differences
between the proposed BFPSO learning scheme and the BFOand LBG-based VQ learning methods demonstrate the superior
image results produced by the proposed algorithm
N.Lekhi et al. (2014) described that outlier detection on data
by making the clusters of data. Outlier can be reduced if we
improve the clustering. So in this Hybrid Algorithm is proposed
that is combination of clustering and classification techniques i.e.
weighted k-mean and neural networks and this algorithm work
on the numeric and text data. By this way cluster is improved and
outlier be reduced and also increase the classification accuracy.
III. Clustering Algorithm
1. Data is arranged in different ways by using different clustering
techniques. Clustering techniques are of many types such as:
A. Hierarchical Clustering
Hierarchical clustering is basically a decomposition technique.
This algorithm divides or combines hierarchical structure which is
also known as dendrogram. It is further divided into two types
1. Agglomerative Clustering-In this type of clustering the
algorithm proceeds from bottom to top. It combines the two
clusters with minimum distance.
2. Divisive clustering-In this type of clustering the algorithm
proceed from top to bottom. A large cluster is divided into
sub clusters.
B. Density Based Clustering
It is based on density of objects. Density based clustering is used
to managed irregular shaped objects.
C. Grid Based Clustering
Grid based clustering algorithm are used for spatial data. It
includes structure of objects in space. The data is clustered into
cells. Various algorithms under grid based clustering are:-
algorithm is based on medoids.
2. K-Mean Algorithm
This algorithm is based on the mean value of the objects in the
cluster.n
IV. Genetic Algorithm
Genetic algorithm (GA) is a stochastic seek strategy that will
mimics the actual healthy advancement offered by simply Charles
Darwin throughout 1858. GA retains the individual connected with
n chromosomes (solutions) with connected fit- ness valuations.
The full course of action is actually identified as uses: moms
and dads are selected to companion, on the basis of their fitness,
making young by using a reproductive system approach. As a
result very healthy alternatives are shown a lot more prospects
to duplicate, making sure that young inherit attributes from each
and every parent or guardian. As parents mate and pro- duce
offspring, room must be made for [12] the new arrivals since the
population is kept at a static size. Individuals in the population die
and then, are replaced by the new solutions, eventually creating a
new generation once all mating opportunities in the old population
have been exhausted. In this manner, it really is estimated that
will in excess of effective many years, much better alternatives
will thrive as the lowest healthy alternatives pass away available.
Right after an initial population is actually randomly created, the
actual algorithm advances by means of a few operators:
1. Selection
Gives preference to better individuals, letting them offer their
genetics to another technology. This amazing benefits of every
specific is dependent upon the fitness. Fitness may be driven by
an objective function or perhaps by way of a very subjective
judgment.
2. Crossover
Represents mating concerning individuals; a couple individual
are chosen in the population using the selection operator. Any
crossover site over the bit strings is actually randomly picked.
The significance of every string is actually exchanged up to
this point. Each new young offspring made from this specific
mating are placed in your next generation from the population.
By recombining portions connected with great individuals, this
method probably will develop better individuals.
1. Clique Algorithm
It is a combination of both grid based and density based clustering.
It is used for clustering high dimensional data in large database.
3. Mutation
Features random alterations. The purpose of mutation is always
to keep diversity within the population and prevent uncontrolled
convergence. With a number of low chance, a smallportion of the
new individuals will have some of their bits dipped. Mutation as
well as assortment (without crossover) develop parallel, noisetolerant, hill-climbing algorithms. This pseudo rule connected
with genetic algorithms is actually found beneath [17].
D. Partitioning Algorithm
This method is based on partition of data. In this type of clustering
the data is partitioned into groups or clusters. The objects in one
cluster are similar with each other but dissimilar from the objects
in other cluster.
V. Bacterial Foraging Algorithm
BFO Algorithm is an optimization algorithm based on the behavior
of E.coli bacteria in the body. It is basically used for solving
optimization problems. Basically it involves two stages run and
tumble that help bacteria looking for food to survive ever.
1. K-Medoids Algorithm
In this algorithm basically every cluster is represented by one
of the object which located near the center of the cluster. This
w w w. i j c s t. c o m
International Journal of Computer Science And Technology 49
IJCST Vol. 6, Issue 3, July - Sept 2015
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
VI. Proposed Algorithm
the outlier data that is detected by K-Mean genetic algorithm.
Hybrid genetic and bacterial foraging algorithm also detects proper
outliers that degrade the performance of system.
In future we can use more NIS Algorithm to detect outlier to
achieve more accurate result. In medical study this work is
extended to detect genes belong to same category by detecting
those genes which does belong to them. In networking instead of
K-Mean algorithm dikstra’s algorithm is used to check the correct
nodes connected with each other.
Algorithm Genetic_BFO
Begin
For dataset Di
{
Upload the complete components
}
For (Di =1, Di< Df , Di++)
{
Extract Feature set Fi
where Fi € Di
{
Do initialize clusters using k-mean algorithm.
}
}
For each (Fi =1, Fi< Ff, Fi++)
{
Apply genetic
Load genetic configurations
}
For Fg
{
Evaluate fitness function
For each (Fg=C, Fg<TValus function< Fg--)
{
If (Data processing = TRUE)
{
Arrange the elements into clusters
and Save the outlier data
}
Upload outlier data in bacterial foraging
optimization algorithm.
If( condition set by bacterial foraging is
satisfied by upload data =True)
{
Classified data
}
Else
save data as outlier
End
VII. Conclusion
Clustering is an unsupervised technique that handles large data
in efficient way into different clusters. Amongst the various
clustering techniques K-Mean clustering Algorithm is a portioning
algorithm with high performance. Outliers are the raw data that
are present in data in the form of numeric values, tags, link, and
addresses. Detection of outlier is important because they degrade
the quality and performance of data. In this research work the
hybrid Genetic and Bacterial foraging optimization algorithm are
used to detect the outliers from data. Genetic algorithm provides
the optimized result that is further processed by bacterial foraging
algorithm to detect raw data. This research work also presents the
comparison between K-Mean genetic algorithm and hybrid genetic
and bacterial foraging algorithm. This research works shows
that K-Mean genetic algorithm detects important information as
outliers. While hybrid genetic algorithm and bacterial foraging
algorithm removes that problem it filters the important data from
50
International Journal of Computer Science And Technology
References
[1] Aggarwal C., Philip S. yu,“Outlier Detection for High
Dimensional Data”, ACM SIGMOD Record, Vol. 30, Issue
2, June 2001, pp. 37 – 46.
[2] Williams, G.; Baxter, R. Hongxing He. Hawkins, S. Lifang
Gu,“A comparative study of RNN for outlier detection in
data mining”, Data Mining, 2002. ICDM 2003. Proceedings.
2002 IEEE International Conference on 2002,
[4] Bakar Z,“A Comparative Study for Outlier Detection
Techniques in Data Mining”, 2006 IEEE.
[5] Ke Wang, Philip S. Yu,“Bottom-Up Generalization: A Data
Mining Solution to Privacy Protection”, Proceedings of
the Fourth IEEE International Conference on Data Mining
(ICDM’04), IEEE.
[6] Mantas Paulinas, Andrius Ušinskas,“A Survey of Genetic
Algorithms Applications for Image Enhancement and
Segmentation”, 2007, Vol. 36, No. 3.
[7] Aggarwal, Ed,"Data Streams–Models and Algorithms”,
Springer, 2007.
[8] B.K. Panigrahi, V. Ravikumar Pandi,“Bacterial foraging
optimisation: Nelder–Mead hybrid algorithm for economic
load dispatch”, IET Gener. Transm. Distrib, Vol. 2, No. 4,
pp. 556-565, 2008.
[9] Elahi, M.; Kun Li; Nisar, W.; Xinjie Lv; Hongan Wang,
“Efficient Clustering-Based Outlier Detection Algorithm
for Dynamic Data Stream”, Fuzzy Systems and Knowledge
Discovery, 2008. FSKD ‘08. Fifth International Conference
on (Vol. 5), pp. 18-20 Oct. 2008.
[10]Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing
Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David
J. Hand, Dan Steinberg,“Top 10 algorithms in data mining”,
knwl inf syst, Vol. 14, pp. 1-37, 2008.
[11] H. Shen, Y. L. Zhu, X. M. Zhou, H. F. Guo, C. G. Chang,
“Bacterial foraging optimization algorithm with particle
swarm optimization strategy for global numerical
optimization”, World Summit on Genetic and Evolutionary
Computation, Shanghai, China, pp. 497-504, June 2009.
[12]Chen Han-ning, Zhu Yun-long, Hu Kun-yuan,“Cooperative
bacterial foraging algorithm for global optimization”, Chinese
control and decision conference, Guilin, China, 2009, 6: pp.
3896-3901.
[13]N.Bansa,“Differentiate Clustering Approaches for Outlier
Detection”, International Journal of Innovative Research in
Computer and Communication Engineering, pp. 193-196,
Vol. 1, 2010.
[14]Velmurugan T., Santhanam T.,"Performance Evaluation of
K-Means and Fuzzy C-Means Clustering Algorithms for
Statistical Distributions of Input Data Points”, European
Journal of Scientific Research, Vol. 46, No. 3, pp. 320-330,
2010.
[15]Salwa K. Abd-El-Hafiz.,“A Metrics-Based Data Mining
Approach for Software Clone Detection”, IEEE 36th
w w w. i j c s t. c o m
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 6, Issue 3, July - Sept 2015
International Conference on Computer Software and
Application, 2012.
[16]Alckmin D. et al,“Hybrid Genetic Algorithm Applied To the
Clustering Problem”, Revista Investigacion Operacional.
Vol. 33, No. 2, pp. 141-151, 2012.
[17]S.D Pachgade, S.S Dhande,“Outlier Detection over Data
set Using Cluster Based and Distance Based Approach”,
International Journal of Advance Research 64.
w w w. i j c s t. c o m
International Journal of Computer Science And Technology 51