Download IJCSI International Journal of Computer Science Issues, Vol. 8, Issue

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
Efficient Spatial Data mining using Integrated Genetic Algorithm and ACO
Mr.K.Sankar1 and Dr. V.Vankatachalam2
1
Assistant Professor(Senior), Department of Master of Computer Applications, KSR
College of Engineering, Tiruchengode
2
Principal, The KAVERY Engineering College, Mecheri, Salem
Abstract
Spatial data plays a key role in
numerous applications such as network
traffic, distributed security applications
such as banking, retailing, etc., The
spatial data is essential mine, useful for
decision making and the knowledge
discovery of interesting facts from large
amounts of data. Many private
institutions, organizations collect the
number of congestion on the network
while packets of data are sent, the flow
of data and the mobility of the same. In
addition other databases provide the
additional information about the client
who has sent the data, the server who
has to receive the data, total number of
clients on the network, etc. These data
contain a mine of useful information for
the network traffic risk analysis. Initially
study was conducted to identify and
predict the number of nodes in the
system; the nodes can either be a client
or a server. It used a decision tree that
studies from the traffic risk in a network.
However, this method is only based on
tabular data and does not exploit geo
routing location.
Using the data,
combined to trend data relating to the
network, the traffic flow, demand, load,
etc., this work aims at deducing relevant
risk models to help in network traffic
safety task.
The existing work provided a
pragmatic approach to multi-layer geodata mining. The process behind was to
prepare input data by joining each layer
table using a given spatial criterion, then
applying a standard method to build a
decision tree. The existing work did not
consider multi-relational data mining
domain. The quality of a decision tree
depends, on the quality of the initial data
which are incomplete, incorrect or non
relevant data inevitably leads to
erroneous results. The proposed model
develops an ant colony algorithm
integrated with GA for the discovery of
spatial trend patterns found in a network
traffic risk analysis database. The
proposed ant colony based spatial data
mining algorithm applies the emergent
intelligent behavior of ant colonies. The
experimental results on a network traffic
(trend layer) spatial database show that
our method has higher efficiency in
performance of the discovery process
compared to other existing approaches
using non-intelligent decision tree
heuristics.
Keywords: Spatial data mining, Network
Traffic, ACO, GA
1. Introduction
Data mining is the process of
extracting patterns from large data sets
by combining methods from statistics
and artificial intelligence with database
management. Given an informational
system, data mining is seen to be used as
an important tool to transform data into
business intelligence process. It is
currently used in wide range of areas
such as marketing, surveillance, fraud
97
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
traffic risk analysis. The proposed work
detection, and scientific discovery.
aims at spatial feature of the traffic load
Automatic data processing is the result
and demand requirements and their
of the increase in size and complexity of
interaction with the geo routing
the data set. This has been used in other
environment. In previous work, the
areas of computer science as neural
system has implemented some spatial
networks, support vector machines,
data mining methods such as
genetic algorithms and decision trees. A
generalization and characterization. The
primary reason for using data mining is
proposal of this work uses intelligent ant
to assist in the analysis of collections of
agent to evaluate the search space of the
observations of network user behavior.
network traffic risk analysis along with
usage of genetic algorithm for risk
Spatial data mining try to find
pattern.
patterns in geographic data. Most
commonly used in retail, it has grown
2. Literature Review
out of the field of data mining, which
initially focused on finding patterns in
Spatial data mining fulfills real
network traffic analysis, security threats
needs of many geomantic applications. It
over a period of time, textual and
allows taking advantage of the growing
numerical electronic information. It is
availability of geographically referenced
considered to be more complicated
data and their potential richness. This
challenge than traditional mining
includes the spatial analysis of risk such
because of the difficulties associated
as epidemic risk or network traffic
with analyzing objects with concrete
accident risk in the router. This work
existences in space and time. Spatial
deals with the method of decision tree
patterns may be discovered using
for spatial data classification. This
techniques
like
classification,
method differs from conventional
association, and clustering and outlier
decision trees by taking account implicit
detection. New techniques are needed
spatial relationships in addition to other
for SDM due to spatial auto-correlation,
object attributes. Ref [2, 3] aims at
importance of non-point data types,
taking account of the spatial feature of
continuity of space, regional knowledge
the packets transmissions and their
and separation between spatial and noninteraction with the geographical
spatial subspace. The explosive growth
environment.
of spatial data and widespread use of
spatial databases emphasize the need for
How are spatial data handled in
the automated discovery of spatial
usual data mining systems? Although
knowledge. Our focus of this work is on
many data-mining applications deal at
the methods of spatial data mining, i.e.,
least implicitly with spatial data they
discovery of interesting knowledge from
essentially ignore the spatial dimension
spatial data of network traffic patterns.
of the data, treating them as non-spatial.
Spatial data are related to traffic data
This has ramifications both for the
objects that occupy space.
analysis of data and for their
visualization. First, one of the basic tasks
The institutions concern the
of exploratory data analysis is to present
routing network studies the application
the salient features of a data set in a
of data mining techniques for network
98
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
there might be little pixel coherence.
format understandable to humans. It is
Therefore, cartogram-based distortion is
well known that visualization in
primarily a preprocessing step.
geographical space is much easier to
understand than visualization in abstract
In [8] the author proposes an
space. Secondly, results of a data mining
Improved Ant Colony Optimization
analysis may be suboptimal or even be
(IACO) and Hybrid Particle Swarm
distorted if unique features of spatial
Optimization (HPSO) method for
data, such as spatial autocorrelation
SCOC. In the process of doing so, the
([7]), are ignored. In sum, convergence
system first use IACO to obtain the
of GIS and data mining in an Internet
shortest obstructed distance, which is an
enabled spatial data mining system is a
effective method for arbitrary shape
logical progression for spatial data
obstacles, and then the system develop a
analysis technology. Related work in this
novel HPKSCOC based on HPSO and
direction has been done by Koperski and
K-Medoids to cluster spatial data with
Han, Ester et al. [4, 9].
obstacles, which can not only give
attention to higher local constringency
Rather than aggregate data,
speed and stronger global optimum
Gridfit [1] avoids overlap in the 2D
search, but also get down to the
display by repositioning pixels locally.
obstacles constraints. Spatial clustering
In areas with high overlap, however, the
is an important research topic in Spatial
repositioning depends on the ordering of
Data Mining (SDM). Many methods
the points in the database, which might
have been proposed in the literature, but
be arbitrary. Gridfit places the first data
few of them have taken into account
item found in the database at its correct
constraints that may be present in the
position,
and
moves
subsequent
data or constraints on the clustering.
overlapping data points to nearby free
These constraints have significant
positions, making their placement
influence on the results of the clustering
quasirandom. Cartograms [5] are another
process of large spatial data. In this
common technique dealing with
project, the system discuss the problem
advanced map distortion. Cartogram
of spatial clustering with obstacles
techniques let data analysts trade shape
constraints and propose a novel spatial
against area and preserve the map’s
clustering method based on Genetic
topology to improve map visualization
Algorithms (GAs) and KMedoids, called
by scaling polygonal elements according
GKSCOC, which aims to cluster spatial
to an external parameter. Thus, in
data with obstacles constraints.[9]
cartogram techniques, the rescaling of
map regions is independent of a local
distribution of the data points. A
3. Genetic and ACO Based Spatial
cartogram-based map distortion provides
Data Mining Model
much better results, but solves neither
the overlap nor the pixel coherence
Before data mining algorithms
problems. Even if the cartogram
can be used, a target data set must be
provides a perfect map distortion (in
collected. As data mining only uncover
many cases, achieving a perfect
patterns already present in the data, the
distortion is impossible), many data
target dataset must be large enough to
points might be at the same location, and
contain these patterns. A common source
99
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
substantially for relational (attribute)
for data is a data mart or data warehouse.
data management and for topological
Pre-process is essential to analyze the
(feature) data management Geographic
multivariate datasets before clustering or
data repositories increasingly include ill
data mining. The target set is then
structured data such as imagery and geo
cleaned.
Cleaning
removes
the
referenced multimedia. The strength of
observations with noise and missing
network GIS is in providing a rich data
data. The clean data are reduced into
infrastructure for combining disparate
feature vectors, one vector per
data in meaningful ways by using spatial
observation. A feature vector is a
proximity.
summarized version of the raw data
observation. This might be turned into a
The next logical step to take
feature vector by locating the eyes and
Network
GIS
analysis
beyond
mouth in the image. The feature vectors
demographic reporting to true market
are divided into two sets, the "training
intelligence is to incorporate the ability
set" and the "test set". The training set is
to analyze and condense a large number
used to "train" the data mining
of variables into a single forecast or
algorithm(s), while the test set is used to
score. This is the strength of predictive
verify the accuracy of any patterns found
data mining technology and the reason
why there is such a true relationship
The proposed spatial data mining
between Network GIS & data mining.
model uses ACO integrated with GA for
Depending upon the specific application,
network risk pattern storage. The
Network GIS can combine historical
proposed ant colony based spatial data
customer or retail store sales data with
mining algorithm applies the emergent
syndicated
demographic,
business,
intelligent behavior of ant colonies. The
network traffic, and market research
proposed system handle the huge search
data. This dataset is then ideal for
space encountered in the discovery of
building predictive models to score new
spatial data knowledge. It applies an
locations or customers for sales
effective greedy heuristic combined with
potential,
cross-selling,
targeted
the trail intensity being laid by ants
marketing, customer churn, and other
using a spatial path. GA uses searching
similar applications. Geospatial data
population to produce a new generation
repositories tend to be very large.
population. The proposed system
Moreover, existing GIS datasets are
develops an ant colony algorithm for the
often splintered into feature and attribute
discovery of spatial trends in a GIS
components that are conventionally
network traffic risk analysis database.
archived in hybrid data management
Intelligent ant agents are used to
systems. Algorithmic requirement differ
evaluate valuable and comprehensive
substantially for relational (attribute)
spatial patterns.
data management and for topological
3.1. Geo-Spatial Data Mining
(feature) data management.
Data volume was a primary
3.2 Ant Colony Optimization
factor in the transition at many federal
agencies from delivering public domain
Ant
colony
Optimization
data
via
physical
mechanisms
algorithm (ACO), a probabilistic
Algorithmic
requirements
differ
100
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
optimal solution. If there were no
technique is deployed for evaluating
evaporation at all, the paths chosen by
spatial data inference from network
the first ants would tend to be
traffic patterns which find load and
excessively attractive to the following
demand at various instances. In the
ones. In that case, the exploration of the
natural world, ants (initially) wander
randomly, and upon finding food return
solution space would be constrained.
to their colony while laying down
pheromone trails. If other ants find such
Thus, when one ant finds a good
a path, they are likely not to keep
(i.e., short) path from the colony to a
traveling at random, but to instead
food source, other ants are more likely to
follow the trail, returning and reinforcing
follow that path, and positive feedback
it if they eventually find food.
eventually leads all the ants following a
single path. The idea of the ant colony
Ant Colony Optimization (ACO)
algorithm is to mimic this behavior with
is a paradigm for designing meta"simulated ants" walking around the
heuristic algorithms for combinatorial
graph representing the problem to solve.
optimization problems. Meta-heuristic
algorithms are algorithms which, in
3.3 Genetic Algorithm
order to escape from local optima, drive
some basic heuristic, either a
The proposed algorithm of
constructive heuristic starting from a
spatial clustering based on GAs is
null solution and adding elements to
described in the following procedure.
build a good complete one, or a local
Divide an individual risk pattern of the
search heuristic starting from a complete
network traffic generating objects
solution and iteratively modifying some
(chromosome) into n part and each part
of its elements in order to achieve a
is corresponding to the classification of a
better one. The metaheuristic part
datum element. The optimization
permits the low level heuristic to obtain
criterion is defined by a Euclidean
solutions better than those it could have
distance among the data frequently, and
achieved alone, even if iterated. The
the initial number of packets that has to
characteristic of ACO algorithms is their
be sent is produced at random. Its
explicit use of elements of previous
genetic operators are similar to standard
solutions
GA’s. This method can find the global
optimum solution and not influenced by
Over
time,
however,
the
an outlier, but it only fits for the
pheromone trail starts to evaporate, thus
situation of small network traffic risk
reducing its attractive strength. The more
pattern data sets and classification
time it takes for an ant to travel down the
number.
path and back again, the more time the
pheromones have to evaporate. A short
4. Experimental Evaluation
path, by comparison, gets marched over
faster, and thus the pheromone density
ACO with GA integration SPDM
remains high as it is laid on the path as
model is proposed to be tested in the
fast as it can evaporate. Pheromone
framework of network traffic risk
evaporation has also the advantage of
analysis. The analysis is done on a
avoiding the convergence to a locally
spatial database provided in the
101
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
accidents and networks. Visualization
framework of an industrial collaboration.
and interaction helps to understand the
It contains data on the number of packets
dependencies within and between data
to be sent and others on the number of
sets. Visualization supports formulating
nodes that is ready to be served in the
hypotheses and answering questions
network. The objective is to construct a
about correlations between certain
predictive model. The system model
variables and collision. Explorative
looks correspondences between the
visualization may reveal new variables
packet and the other trend layers as the
relevant to the model and relevance of
number of nodes, time taken for the
already used variables. It is highly
packet to reach at the other end etc. It
required to analyze the correlations
applies classification by decision tree
combine spatial data analysis methods
while integrating the number of packets
with
visualization.
Risk
model
to be transmitted via spatial character
development is an interactive and
and their interaction with the
explorative process.
geographical
environment.
The
experimental evaluation is made on a
geographical network traffic (trend
4.2 ACO on SPDM
layer) spatial database to depict higher
ACO has been recently used in
efficiency in performance of the
some
data
mining
tasks,
e.g.
discovery process. It proves that better
classification
rule
discovery.
quality of trend patterns discovered
Considering the challenges faced in the
compared to other existing approaches
problem of spatial trend detection, ACO
using non-intelligent decision tree
suggest efficient properties in these
heuristics. Reliable data constitute the
aspects. Ant agents search for the trend
key to success of a decision tree. An
starting from their own start point in a
efficient parallel and near global
completely distributed manner. This
optimum search for network traffic risk
guides the search process to infer to a
patterns are evaluated using genetic
better subspace potentially containing
algorithm. It combines the concept of
more and better trend patterns. Finally
survival of the fittest with a structured
some measures of attractiveness can be
interchange. GAs imitates natural
defined for selecting a feasible spatial
selection of the biological evolution.
object from the neighborhood graph.
Improvements in the identification of
ACO on SPDM Effectively guide the
high or low risk areas can assist the
trend detection process of an ant ACO
emergency preparedness planning and
has been recently used in some data
resource evaluation.
mining tasks, e.g., classification rule
discovery. Considering the challenges
4.1 Spatial Data Mining on Network
faced in the problem of spatial trend
Traffic Risk Patterns
detection, ACO suggest efficient
properties in these aspects. Ant agents
Spatial data mining on network
search for the trend starting from their
traffic risk pattern focuses on the human
own start point in a completely
vulnerability in built environments. It
distributed manner. Finally some
considers issues like differences between
measures of attractiveness can be
common and rare collision, commuting
of people, and relations between
102
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
a domain specialist in traffic risk
defined for selecting a feasible spatial
analysis. The advantage of proposed
object from the neighborhood graph.
technique allows the end-user to
evaluate the results without any
4.3 Spatial Clustering GA
assistance by an analyst or statistician.
Gas
automatically
achieve
and
Genetic algorithms are an
accumulate the knowledge about the
efficient parallel and near global
search space of the ACO. GA adaptively
optimum search method based on nature
controls the traffic risk pattern search
genetic and selection. GA combines the
process to approach a global optimal
concept of survival of the fittest with a
solution. Perform well in highly
structured interchange. Gas imitates
constrained traffic risk pattern, where the
natural selection of the biological
number of “good” solutions is very small
evolution. It uses searching population
relative to the size of the search space.
(set) to produce a new generation
population. GAs automatically achieve
The current application results
and accumulate the knowledge about the
show a use case of spatial decision trees.
search space. GA adaptively controls the
The contribution of this approach to
search process to approach a global
spatial classification lies in its simplicity
optimal solution. GA performs well in
and its efficiency. It makes it possible to
highly constrained problems, where the
classify objects according to spatial
number of “good” solutions is very small
information (using the distance). It
relative to the size of the search space.
allows adapting any decision tree
GAs provides better solution in a shorter
algorithm or tool for a spatial modeling
time, including complex problems to
problem. Furthermore, this method
solve by traditional methods.
considers the structure of geo-data in
multiple trends (patterns) which is
5. Result and Discussions
characteristic of geographical databases.
The proposed results provide
The graph below indicates the number of
spatial decision trees for network traffic
trends found and paths examined using
risk patterns with optimized route
SPDM Decision Tree and SPDM-ACOstructure with the ant agents. The
GA models for traffic risk pattern
proposed model classifies objects
analysis.
according to spatial information (using
the ant agent and the distance
pheromone).
Spatial
classification
6. Conclusion
The Spatial data mining system
provided by the proposed scheme is
of ACO with GA have shown that
simple and efficient. It allows adapting
network traffic risk patterns are
to different decision tree algorithm for
discovered efficiently and recorded in
the spatial modeling of network traffic
the genetic property for avoiding the
risk patterns. It uses the structure of geocollision risk in highly dense spatial
data in multiple trend layers which is
regions. The proposal of our system
characteristic of geographical databases.
analyzes existing methods for spatial
Finally, the quality of this analysis is
data mining and mentioned their
improved by enriching the spatial
strengths and weaknesses. The variety of
database by multiple geographical
yet unexplored topics and problems
trends, and by a close collaboration with
103
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
[3] Anselin, L. 1995. Local Indicators of
makes knowledge discovery in spatial
Spatial Association: LISA. Geographical
databases an attractive and challenging
Analysis 27(2):93-115.
research field. This work gives an
[4] Ester, M., Frommelt, A., Kriegel,
efficient approach to multi-layer geoH.P, Sander, J., “Spatial Data Mining:
data mining. The main idea is to prepare
Database Primitives, Algorithms and
input data by joining each layer table
Efficient DBMS Support”, in Data
using a given spatial criterion, then
applying a standard method to build' a
Minining and Knowledge Discovery, an
International Journal, 1999
decision tree. The most advantage is to
[5] D.A. Keim, S.C. North, and C.
demonstrate the feasibility and the
Panse, “Cartodraw: A Fast Algorithm for
interest of integrating neighborhood
Generating Contiguous Cartograms,”
properties when analyzing spatial
objects. Our future work will focus on
IEEE
Trans.
Visualization
and
Computer Graphics (TVCG), vol. 10,
adapting recent work in multi-relational
no. 1, 2004, pp. 95-110.
data mining domain, in particular on the
[6] Haining, R. Spatial data analysis in
extension of the spatial decision trees
the social and environmental sciences,
based on neural network. Another
Cambridge Univ. Press, 1991
extension will concern automatic
[7] Hawkins, D. 1980. Identification of
filtering of spatial relationships. The
Outliers. Chapman and Hall. [Jain &
system will study of its functional
Dubes1988] Jain, A., and Dubes, R.
behavior and its performances for
1988. Algorithms for Clustering Data.
concrete cases, which has never been
Prentice Hall.
done before. Finally, the quality of this
[8] Jhung, Y., and Swain, P. H. 1996.
analysis could be improved by enriching
Bayesian Contextual Classification
the
spatial
database
by
other
Based on Modified M-Estimates and
geographical trends, and by a close
Markov
Random
Fields.
IEEE
collaboration with a domain specialist in
Transaction on Pattern Analysis and
traffic risk analysis. Indeed, the quality
Machine Intelligence 34(1):67-75.
of a decision tree depends, on the whole,
[9] Koperski, K., Han, J. “GeoMiner: A
of the quality of the initial data.
System Prototype for Spatial Mining”,
Proceedings ACMSIGMOD, Arizona,
REFERENCES
1997.
[1] . D.A. Keim and A. Herrmann, “The
K.Sankar is a Research Scholar at the
Gridfit Algorithm: An Efficient and
Anna University Coimbatore. He is now
Effective Approach to Visualizing Large
working as a Assistant Professor(Sr) at
Amounts of Spatial Data,” Proc. IEEE
KSR
College
of
Engineering,
Visualization Conf., IEEE CS Press,
Tiruchengode. His Research interests
1998, pp. 181-188.
are in the field of Data Mining and
[2]
Anselin,
L.
1988.
Spatial
Optimization Techniques.
Econometrics: Methods and Models.
Dordrecht,
Netherlands:
Kluwer.
Anselin, L. 1994. Exploratory Spatial
Dr.V.Vankatachalam is a principal of
Data
Analysis
and
Geographic
The Kavery Engineering College. He
Information Systems. In Painho, M., ed.,
received his B.E in Electronics and
New Tools for Spatial Analysis, 45-54.
104
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org
Communication at Coimbatore Institute
of Technology Coimbatore. He obtained
his M.S degree in Software systems
from Birla Institute of Technology
Pilani. He did his M.Tech in Computer
Science at Regional Engineering College
(REC) Trichy. He obtained his Ph.D
degree in Computer Science and
Engineering from Anna University
Chennai. He has published 3 papers in
International Journal and 20 papers in
International & National conferences.
105