Download Title Distributed Clustering Algorithm for Spatial Data Mining Author(s)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please
cite the published version when available.
Title
Author(s)
Distributed Clustering Algorithm for Spatial Data Mining
Bendechache, Malika; Kechadi, Tahar; Chen, Chong Cheng
Publication
date
2015
Conference
details
International conference on Integrated Geo-spatial Information
Technology and its Application to Resource and Environmental
Management towards GEOSS (IGIT 2015), Alba Regia
Technical Faculty of Óbuda University, Hungary, 16-17
January 2015
Link to online http://www.geo.info.hu/igit/
version
Item
record/more
information
http://hdl.handle.net/10197/6526
Downloaded 2017-05-03T13:09:26Z
The UCD community has made this article openly available. Please share how this access benefits you. Your
story matters! (@ucd_oa)
Some rights reserved. For more information, please see the item record link above.
Distributed Clustering Algorithm for Spatial Data
Mining
Malika Bendechache1, M-Tahar Kechadi1, Chong Cheng Chen2
1
School of Computer Science & Informatics, University College Dublin
Belfield, Dublin 04, Ireland
[email protected], [email protected]
2
Spatial Information Research Center of Fujian, Fuzhou University
No. 523, Gongye Rd, Gulou District, Fuzhou 35002, China
[email protected]
Abstract— Distributed data mining techniques and mainly
distributed clustering are widely used in last decade because they
deal with very large and heterogeneous datasets which cannot be
gathered centrally. Current distributed clustering approaches
are normally generating global models by aggregating local
results that are obtained on each site. While this approach
analyses the datasets on their locations the aggregation phase is
complex, time consuming and may produce incorrect and
ambiguous global clusters and therefore incorrect knowledge. In
this paper we propose a new clustering approach for very large
spatial datasets that are heterogeneous and distributed. The
approach is based on K-means Algorithm but it generates the
number of global clusters dynamically. It is not necessary to fix
the number of clusters. Moreover, this approach uses a very
sophisticated aggregation phase. The aggregation phase is
designed in such away that the final clusters are compact and
accurate while the overall process is efficient in time and memory
allocation. Preliminary results show that the proposed approach
scales up well in terms of running time, and result quality, we
also compared it to two other clustering algorithms BIRCH and
CURE and we show clearly this approach is much more efficient
than the two algorithms.
Keywords— Spatial data, clustering, distributed mining, data
analysis, k-means.
addition, this can only be possible if the datasets have already
been distributed over the nodes which is currently the case as
massive amounts of data are stored on different sites as they
were produced. In this context, distributed data mining (DDM)
techniques have become necessary for analysing these large
and multi-dimensional datasets. DDM is more appropriate for
large-scale distributed platforms where datasets are often
geographically distributed and owned by different
organisations. Many DDM tasks such as distributed
association
rules
and
distributed
classification
[T.G.Dietterich-00,
H.Kargupta-00,
R.Agrawal-96,
L.Aouad-09, L.Aouad6-10, N.Le-Khac-10] have been
proposed and developed in the last few years. However, only a
few research concerns distributed clustering for analysing
large, heterogeneous and distributed datasets. Recent
researches [E.Januzaj-04, N.A.Le-Khac-07, L.Aouad1-07]
have proposed a distributed clustering model that is based on
two main steps: perform partial analysis on local data at
individual sites and then send them to a central site to generate
global models by aggregating the local results. Even though,
the existing distributed clustering do not scale well and one of
the biggest issues of these systems is the generation of good
global models, because local models do not contain enough
information for the merging process.
In this paper, we propose a new approach of distributed
clustering. In our approach, centralise clustering is also
performed at each site to build local models. These models are
sent to the servers where clusters will be regenerated based on
local models features. Further analysis will be carried out to
create global models [J.F.Laloux-11].
The rest of this paper is organised as follows: In the next
section we will give an overview of distributed data mining
and discuss the limitations of traditional techniques. Then we
will present and discuss our approach in Section 3. Section 4
presents the implementation of the approach and we discuss
experimental results in Section 5. Finally, we conclude on
Section 6.
I.
INTRODUCTION
Across a wide variety of fields, dataset are being collected
and accumulated at a dramatic pace and massive amounts of
data that are being gathered are stored in different sites. In this
context, data mining (DM) techniques have become necessary
for extracting useful knowledge from the rapidly growing
large and multi-dimensional datasets [U. Fayyad-96]. In order
to cope with scalability of data, researchers have developed
parallel versions of the sequential DM algorithms [A. Freitas07]. Because of the resource limitations to process large
amounts of datasets, parallel data mining (PDM) can be
deployed on distributed platforms such as clusters of
computers or Grid [I.Foster-99]. In PDM algorithms the
calculations are performed on each node to create local results.
II.
RELATED WORK
These local results will be aggregated to build global ones.
While massive amounts of data are being collected and
However, in the case of very large datasets, the aggregation
stored
from not only science fields but also industry and
phase will create significant communication overheads. In
commerce fields, efficient mining and management of useful
information of this data is becoming a scientific challenge and
a massive economic need. This is even more critical when the
collected data is located on different sites and are owned by
different organisations [E.Jiawei-06]. This led to the
development of DDM techniques to deal with huge, multidimensional, and heterogeneous datasets, which are
distributed over a large number of nodes. Existing DDM
techniques consist of two main phases: 1) performing partial
analysis on local data at individual sites and 2) generating
global models by aggregating the local results. These two
steps are not independent since naive approaches to local
analysis may produce incorrect and ambiguous global data
models. In order to take the advantage of mined knowledge at
different locations, DDM should have a view of the
knowledge that not only facilitates their integration but also
minimises the effect of the local results on the global models.
Briefly, an efficient management of distributed knowledge is
one of the key factors affecting the outputs of these techniques.
Moreover, the data that will be collected in different
locations using different instruments may have different
formats, features, and quality. Traditional centralised data
mining techniques do not consider all the issues of data-driven
applications such as scalability in both response time and
accuracy of solutions, distribution and heterogeneity
[L.Aouad-09, M.Bertolotto-07].
Some distributed DM approaches are based on ensemble
learning, which uses various techniques to aggregate the
results [Le-Khac-07], among the most cited in the literature:
majority voting, weighted voting, and stacking [P.Chan-95,
Reeves-93]. Recently, very interesting distributed approaches
were introduced. The first phase consists of generating local
knowledge based on local datasets. The second phase involves
integrating local knowledge to generate global results or
decisions. Some approaches are well suited to be performed
on distributed platforms. For instance, the incremental
algorithms for discovering spatio-temporal patterns by
decomposing the search space into a hierarchical structure,
addressing its application to multi-granular spatial data can be
very easily optimised on hierarchical distributed system
topology. From the literature, two categories of techniques are
used: parallel techniques that often require dedicated
machines and tools for communication between parallel
processes which are very expensive, and techniques based on
aggregation, which proceed with a purely distributed, either
on the data based models or on the execution platforms
[L.Aouad3-07, L.Aouad4-09]. However, the amount of data
continues to increase in recent years, as such, the majority of
existing data mining techniques are not performing well as
they suffers from scalability. This becomes a very critical
issue in recent years. Many solutions have been proposed so
far. They are generally based on improvement of data mining
algorithms so that they fit with the amounts of data at hand.
Clustering is one of the fundamental techniques in data
mining. It groups data objects based on information found in
the data that describes the objects and their relationships. The
goal is to optimize similarity within a cluster and the
dissimilarities between clusters in order to identify interesting
structures in the data [L.Aouad3-07]. Clustering algorithms
can be divided into two main categories, namely partitioning
and hierarchical. Different elaborated taxonomies of existing
clustering algorithms are given in the literature. Many parallel
clustering versions based on these algorithms have been
proposed [L.Aouad3-07, I.Dhillon-99, M.Ester-96, Garg-06,
H.Geng-05, Inderjit-00, X.Xu-99], etc. These algorithms are
further classified into two sub-categories. The first consists of
methods requiring multiple rounds of message passing. They
require a significant amount of synchronization. The second
sub-category consists of methods that build local clustering
models and send them to a central site to build global models
[J.F.Laloux-11]. In [I.Dhillon-99] and [Inderjit-00],
message-passing versions of the widely used k-means
algorithm were proposed. In [M.Ester-96] and [X.Xu-99], the
authors dealt with the parallelization of the DBSCAN density
based clustering algorithm. In [Garg-06] a parallel message
passing version of the BIRCH algorithm was presented. A
parallel version of a hierarchical clustering algorithm, called
MPC for Message Passing Clustering, which is especially
dedicated to Microarray data was introduced in [H.Geng-05].
Most of the parallel approaches need either multiple
synchronization constraints between processes or a global
view of the dataset, or both [L.Aouad3-07].
Another approach presented in [L.Aouad1-07] also applied
a merging of local models to create the global models, current
approaches only focus on either merging local models or
mining a set of local models to build global ones. If the local
models cannot effectively represent local datasets then global
model accuracy will be very poor [J.F.Laloux-11].
Both partitioning and hierarchical categories suffer from
some drawbacks. For the partitioning class, we have k-means
algorithm witch needs the number of clusters fixed in advance,
while in the majority of cases K is not known, furthermore
hierarchical clustering algorithms have overcome this limit,
they no not need to provide the cluster number as an input
parameter, but they must define the stopping conditions for
clustering decomposition, which are not easy.
III.
DISTRIBUTED CLUSTERING
The proposed distributed approach is for clustering spatial
datasets. It is based to two main steps; 1) it generates local
clusters on each dataset assigned to a given processing node.
The local clustering algorithm is chosen to be K-Means
executed with a given (Ki) which can be different for each
node (see Figure 1). Ki should be chosen to be big enough to
identify all clusters of local dataset.
The boundaries of the clusters represents the new dataset,
and they are much more smaller than the original datasets. So
the boundaries of the clusters will become the local results at
each node in the system. These local results are sent to the
leaders through the tree topology of nodes till we reach the
root node in which global results will be generated and stored.
IV.
Fig 1. Overview of the proposed Approach
After generating local results, nodes compare their local
clusters with their neighbours nodes’ clusters, then a nodes
called leaders will be elected to merge local clusters to form
larger clusters using the overlay technique. These leaders are
elected according to some conditions such as the capacity of
node, etc. The process of merging clusters will continue until
we reach the root node. The root node will contain the global
clusters (models).
During the second phase communicating the local clusters
to the heads may generate huge overhead. Therefore, the
objective is to minimise the data communication and
computational time, while getting accurate global results.
Note that our approach while it is based on the same principle
as existing approaches, it minimises the communication
overheads due to the data exchange. Therefore instead of
exchanging the whole data (whole clusters) between nodes
(local nodes and leaders), we first proceed by reducing the
data that represent a cluster. The size of this new data cluster
is much smaller that the initial one. This process is carried out
on each local node.
There are many data reduction techniques proposed in the
literature. Many of them are focusing only in dataset size i.e.,
they try to reduce the storage of the data without paying
attention to their knowledge behind this data. In [N-A-10], an
efficient reduction technique has been proposed; it is based on
density-based clustering algorithms. In this approach, we
choose representatives of each cluster. However, selecting
representatives is still a challenge in terms of quality and size
of this set. We can choose, for instance, medoids points, core
points, or even specific core points [E.Januzaj-04] as
representatives [J.F.Laloux-11].
In our approach, we focus on the shape and the density of
clusters created. The shape of a cluster is represented by its
boundary points (called contour) (see Fig 2). Many algorithms
for extracting the boundaries from a cluster are cited in the
literature [M.J.Fadilia-04, A.Ray-97, M.Melkemi-00,
Edelsbrunner-83, A.Moreira-07]. In our approach we used
the algorithm proposed in [M. Duckhama-08] which is based
in Triangulation to generate the boundary of the cluster, it is
an efficient algorithm for constructing a possibly non-convex
boundary. The algorithm is able to accurately characterise the
shape of a wide range of different point distributions and
densities with a reasonable complexity of O(n log n).
IMPLEMENTATION
A. Summarised Algorithm.
This section summarises the proposed approach. In the first
step, local clustering is performed on each local dataset using
K-means algorithm in parallel, the local number of clusters
can be different in each site. Then, each local clustering in a
node i produces Ki sub clusters. In the next step, each node
applies an algorithm for calculating the contour of each cluster
generating locally. These contours will represent new local
data for each node. At the end of local processes, each node
compares its clusters to its neighbour’s clusters to find out if
there is overlapping between the contours. Therefore each
node will belong to a group, which represent its neighbours,
so each node will communicate only with its neighbours,
which optimises the communication in the system.
In the third step local contours are sent to the chosen
merging leaders, these leaders are elected among nodes of
each group, the leaders proceed to merge the overlapping
contours of its group. Therefore, each leader generates new
contours (new local dataset). We repeat second and third steps
till we reach root node. The sub-clusters aggregation is done
under tree structure and the global results are located in the
last level of the tree (root node).
As in all clustering algorithms, the expected large
variability in clusters shapes and densities is an issue.
However, as we will show in the experiments section, the
algorithm used for generating the cluster’s contour is efficient
to detect well-separated clusters with different shapes. On the
other hand our approach determines dynamically the number
of the clusters without a priori knowledge about the data or an
estimation process for the number of clusters.
 Pseudo-code algorithm: Distributed Clustering
Algorithm based on K-means:
The system has a tree topology and the nodes execute local
calculations in parallel.
1) Data Distribution: Initially data is divided among
all nodes in the systems.
2) The basic nodes (tree leaves) will execute in
parallel K-means algorithm with a Ki different for
each node that can be larger than the number of
clusters actually present in these local datasets.
3) Neighbouring nodes must share their clusters to
form even larger clusters using the overlay
technique.
4) The results must reside in the father node (called
ancestor).
5) Repeat 3 and 4 until reaching the root (heap) node.
In the following we give a pseudo-code that is executed in
parallel in each node of the system:
Algorithm 1 shows the pseudo code for the Nodei in the
system.
Algorithm 1: Distributed clustering algorithm based on KMeans
Input : Xi: Dataset Fragment, Ki: Number of sub-clusters
for Nodei, D: tree degree.
Output: Kg: Global Clusters (global results)
level = treeheight;
1) K-means(Xi. Ki);
// Nodei executes K-Means algorithm locally.
2) Nodei joins a group G of D elements;
// Nodei joins his neighbourhood.
3) Compare cluster of Nodei to other Node’s clusters in
the same group;
// look for overlapping between Clusters
4) j= elect leader Node();
// elect a node which will merge the overlapping
clusters
if (i <> j) then
Send(contour i, j);
else
if( level > 0) then
level - - ;
Repeat 2, 3, and 4 until level=1;
else
return (Kg: Nodei’s clusters);
cluster are selected from the ones for the two original clusters
rather than all points in the merged cluster.
Our Algorithm: Our algorithm is described in Section IV.
It executes K-Means algorithm with different K for each node.
The key point is to choose K as bigger than the number of
correct clusters, i.e., the big K is, the accurate clusters
generated are. As described at the end of Section IV, when
two clusters are merged in each step of the algorithm,
representative points for the new merged cluster are
represented by union of the two contours of the two original
clusters rather than all points in the merged cluster. This
speeds up the execution times without adversely impacting the
quality of clusters generated. In addition, our implementation
uses the tree topology and heap data structures. Thus, this also
improves the complexity of the algorithm.
B. Example of execution
We suppose that the system contains N = 5 Nodes, each
Node executes K-Means algorithm with different Ki, as it is
shown in the Fig 2, Node1 execute K-Means algorithm with
K=30, Node2 with K=60, Node3 with K=90, Node4 with
k=120, and Node5 with K= 150. Therefore each node in the
system generate its local clusters, afterwards, local clusters,
which are overlapping between each others are merged to
generate new clusters and so on, till we obtain the final
clusters. As we can see, although we started by different value
of K, we generated only five clusters results. (See Fig 2).
V.
EXPERIMENTAL RESULTS
In this section, we study the performance of our Algorithm
and demonstrate its effectiveness compared to BIRCH and
CURE algorithms:
BIRCH: We used the implementation of BIRCH provided
by the authors in [Tian-96]. The implementation performs a
pre-clustering and then uses a centroid-based hierarchical
clustering algorithm with time and space complexity that is
quadratic to the number of points after pre-clustering. We set
parameter values to the default values suggested in [Tian-96].
CURE: We used the implementation of CURE provided by
the authors in [Sudipto-01]. The algorithm uses representative
points with shrinking towards the mean. As described at
[Sudipto-01], when two clusters are merged in each step of
the algorithm, representative points for the new merged
Fig 2. Distributed Clustering Algorithm based on K-Means
TABLE1
DATASET
Number of
Points
Dataset
14000
Shape of Clusters
Number of
Clusters
Big Oval (Egg
Shape)
5
A. Data sets
We experimented with different datasets containing points
in two dimensions. As a short paper, we report here the
results with only one dataset, as illustrated in TABLE1. The
number of points in the dataset is also given in TABLE1.
The dataset contains fives big elliptic shapes that traditional
partitioned and hierarchical clustering algorithms, including
BIRCH and CURE, fail to find, we show that our Algorithm
not only correctly clusters this dataset, but also its execution
times is much quicker than BIRCH and CURE.
B. Quality of Clustering
We run the three algorithms on the same dataset to
compare them with respect to the quality of clusters
generated. Fig 3 shows the clusters found by the three
algorithms for the same dataset. The clusters generated for
BIRCH and CURE are identified by colours, each colour
represents a cluster.
As expected, since BIRCH uses a centroid-based
hierarchical clustering algorithm for clustering the preclustered points, it could not find all the clusters correctly. It
splits the larger cluster while merging the others. In contrast,
the CURE algorithm succeeds to generate majority of
clusters but it still fails to discover all the clusters. Our
Distributed Clustering Algorithm successfully generates all
the clusters with the default parameters setting in section IV,
as it is shown in Fig 3, after merging the local clusters, we
generated five final clusters.
C. Comparison of DISCBK1 Execution Time to BIRCH and
CURE
The goal of our experiment in this subsection is to
demonstrate the impact of using the combination of parallel
and distributed architecture to deal with the limited capacity
of a node in the system and tree topology to accelerate the
speed of computation.
Fig 4. Comparison to BIRCH and CURE.
Fig. 4 illustrates the performance of our algorithm
compared to BIRCH and CURE as the number of data points
is increased from 100000 to 500000 the number of clusters
and their shapes are not altered. Thus, for our algorithm we
consider two values for the number of nodes in the system:
N=100 and N=200, in order to show the effectiveness of the
parallelisation phase of the algorithm. The execution times do
not include the time for display the clusters since these are
approximately the same for the three algorithms.
As the graphs demonstrate, our algorithm’s execution times
are always lower than CURE’s and BIRCH’s, In addition;
increasing in number of nodes in the system further improves
our running times. Finally, as the number of points increases,
our execution times increase slowly. In contrast, the
executions times for BIRCH’s pre-clustering algorithm
increases much more rapidly with increasing dataset size. This
is because BIRCH scans the entire database and uses all the
points in the dataset for pre-clustering. Thus, the above results
confirm that our distributed clustering algorithm is very
efficient compared to both BIRCH and CURE.
D. Scalability
The goal of the scalability experiments is to determine the
effects of increasing in number of nodes in the system on
execution times. The dataset contains 1 000 000 points. The
results for varying number of nodes are shown in Fig 5, our
algorithm took only a few seconds to cluster 1000000 points
in a distributed system that contains over 500 nodes. Thus, the
algorithm can comfortably handle high-dimensional data
because the computations are performed in parallel.
Fig 3. Clusters generated from dataset.
1
DIStributed Clustering Algorithm Based on K-means
In the execution times, we do not include the time for the
display phase. In Fig 5, we plot the execution time for the
algorithm as the number of nodes increases from 10 to 600.
The graphs confirm that the computational complexity of our
algorithm is logarithmic, with respect to the dataset size.
REFERENCES
X. U. Fayyad, Y.G. Piatetsky-shapiro, and Z.P. Smyth, “Knowledge
discovery and data mining: Towards a unifying framework”, in Proc
AAAI Press, 1996, pp. 82–88.
X.A. Freitas and Y.H. Lavington, Mining very large databases with
parallel processing. Springer; 2000 edition, 30 November 2007.
X.I. Foster and Y.C. Kesselman, The Grid: Blueprint for a New
Computing Infrastructure. Morgan Kaufmann Publishers Inc. San
Francisco, CA, USA, 1999.
X.T. G. Dietterich, “An experimental comparison of three methods for
constructing ensembles of decision trees: Bagging, boosting and
randomization,” Machine Learning , vol. 40, pp. 139–157, 2000.
X.H. Kargupta and Y.P. Chan, “Advances in distributed and Parallel
Knowledge Discovery,” vol 5, USA MIT Press Cambridge, MA,
October 2000.
X.R. Agrawal and Y.J. C. Shafer, “Parallel mining of association
rules,” IEEE Transactions on Knowledge and Data Engineering, vol. 8,
pp. 962–969, 1996.
Fig 5. Scalability Experiments.
VI.
CONCLUSIONS
In this paper, we present a new approach for distributed
clustering on spatial datasets. In this approach, local models
are generated by executing K-Means algorithm in each node,
and then the local results are merged to build the global
clusters. The local models are extracted from the local datasets
so that their sizes are small enough to be exchanged through
the network.
Besides, we recover local datasets from their local models
and then merge them together to build the new global models.
Preliminary results of this algorithm are also presented and
discussed. They also show the effectiveness of our approach
either on quantity of clusters generated or the execution time
comparing to BIRCH and CURE algorithms. Furthermore,
they demonstrate that the algorithm not only outperforms
existing algorithms but also scales well for large databases
without sacrificing clustering quality. Our new method is
different from current distributed clustering models presented
in the literature as it regenerates the global result from mining
local dataset using K-means algorithm with different K for
each node in the system. Most of the current methods are
based on exchanging the whole clusters to build the global one.
A more extensive evaluation is ongoing. In the future we
intend to execute our Algorithm with different large real
world datasets in order to prove its robustness.
ACKNOWLEDGMENT
The research and the publication were supported by the
European Union project IGIT titled as “Integrated geo-spatial
information technology and its application to resource and
environmental management towards the GEOSS” (grant
agreement PIRSES-GA-2009-247608). We would like to
thank our researchers for making available their scientific
results for this analysis.
X.L. Aouad, Y.N.-A. Le-Khac, and Z.M.-T. Kechadi, “Performance
study of a distributed apriori-like frequent itemsets mining technique,”
Knowledge Information Systems ISSN 0219-3116, vol. 23, pp. 55-72,
Apr. 2010.
X.L. Aouad, Y. N.-A. Le-Khac, and Z.M.-T. Kechadi, “Grid-based
approaches for distributed data mining applications,” algorithms
Computational Technology, ISSN 1748-3018, vol. 3, pp. 517–534, 10
December 2009.
X.N. Le-Khac, X.L.Aouad, and Z. M.-T. Kechadi, Toward a
Distributed Knowledge Discovery on Grid Systems. London . Springer,
April 2010.
X. E. Januzaj, Y.H.-P. Kriegel, and Z. M. Pfeifle, “DBDC: Densitybased distributed clustering,” 8th European Conference on Principles
and Practice of Knowledge Discovery in Databases (PKDD) Pisa, 231244, vol. 2992, Italy, pp. 88-105, 2004.
X.N.-A. Le-Khac, Y.L. Aouad, and Z.M.-T. Kechadi, “A new
approach for distributed density based clustering on grid platform.” In
proc Springer-Verlag Berlin, Heidelberg, 2007, pp. 247–258.
X. L. Aouad, Y. N.-A. Le-Khac, and Z. M.-T. Kechadi, Advances in
Data Mining. Theoretical Aspects and Applications. Springer Berlin
Heidelberg, Germany, 14-18 July 2007.
X. J.P. Jiawei Han, Y. Micheline Kamber, Data Mining Concept and
Techniques 2nd edition. Morgan Kaufmann, 6 April 2006.
X.J.-F. Laloux,
Y.Le-Khac, and Z. M.-T. Kechadi, “Efficient
distributed approach for density-based clustering,” Enabling
Technologies: Infrastructure for Collaborative Enterprises (WETICE),
20th IEEE International Workshops, pp. 145 – 150, 27-29, June 2011.
X.M. Bertolotto, Y.S. Martino, Z. F.Ferrucci, , and M.-T. Kechadi,
“Towards a framework for mining and analysing spatio-temporal
datasets,” International Journal of Geographical Information Science –
Geovisual Analytics for Spatial Decision Support, vol. 21, pp. 895906, January 2007.
X. P. Chan and Y.S. J. Stolfo, “A comparative evaluation of voting
and meta-learning on partitioned data,” in In Proceedings of the
Twelfth International Conference on Machine Learning. Morgan
Kaufmann, 1995, pp. 90–98.
X. C.R. Reeves, Modern heuristic techniques for combinatorial
problems. 1st edition, John Wiley & Sons, Inc. New York, NY, USA,
11 May 1993.
X.Tian Zhang, and Y. Raghu Ramakrishnan and Z. Miron Livny,
“Birch: An efficient data clustering method for very large databases,”
in SIGMOD ’96 Proceedings of the 1996 ACM SIGMOD international
conference on Management of data, vol. 25. ACM New York, NY,
USA, 1996, pp. 103–114.
X.L. Aouad, Y. N. Le-Khac, and Z. M.-T. Kechadi, “Lightweight
clustering
technique for distributed data mining applications,” in Advances in
Data Mining. Theoretical Aspects and Applications, Germany.
Springer Berlin Heidelberg, 2007, pp. 120–134.
X.Sudipto Guha and Y. Rajeev Rastogi and Z. Kyuseok Shim,
“CURE: An efficient clustering algorithm for large databases,” vol. 26.
Elsevier Science Ltd. Oxford, UK, 17 November 2001, pp. 35–58.
X. I. Dhillon and Y. D. Modha, “A data-clustering algorithm on
distributed memory multiprocessor,” in large-Scale Parallel Data
Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD.
Springer-Verlag London, UK, 1999, pp. 245–260.
X.M. Ester, Y.H. peter Kriegel, J. S, and Z. X. Xu, “A density-based
algorithm for discovering clusters in large spatial databases with
noise.” AAAI Press, 1996, pp. 226–231.
X.Garg, Y. Mangla, Bhatnagar, and Z. Gupta, “PBIRCH: A scalable
parallel clustering algorithm for incremental data,” Database
Engineering and Applications Symposium. IDEAS ’06. 10th
International, Delhi, pp. 315-316, 2006.
X.H.Geng, Y.Omaha, and Z. X. Deng, “A new clustering algorithm
using message passing and its applications in analyzing microarray
data,” in ICMLA ’05 Proceedings of the Fourth International
Conference on Machine Learning and Applications. IEEE, 15-17
December 2005, pp. 145–150.
X. Inderjit Y. D. Dhillon and Z. Dharmendra S. Modha, “A dataclustering algorithm on distributed memory multiprocessors,” in
Large-Scale Parallel Data Mining. Springer Berlin Heidelberg, 2000,
pp. 245–260.
X. X. Xu, Y. J. Jager, and Z. H.-P. Kriegel, “A fast parallel clustering
algorithm ¨ for large spatial databases,” Data Mining and Knowledge
Discovery archive, vol. 3, pp. 263 – 290, September 1999.
X.N.-A. L. Khac, Y. M. Bue, and Z. M. Whelan, “A knowledge based
data reduction for very large spatio-temporal datasets,” International
Conference on Advanced Data Mining and Applications,
(ADMA’2010), 19-21 November 2010.
X.M.J. Fadilia and Y. M. Melkemib and Z. A. ElMoataza, Pattern
Recognition Letters:Non-convex onion-peeling using a shape hull
algorithm. ELSEVIER, 15 October 2004, vol. 24.
X.A.Ray Chaudhuri and Y. B.B. Chaudhuri and Z. S.K. Parui, “A
novel approach to computation of the shape of a dot pattern and
extraction of its perceptual border,” Computer vision and Image
Understanding, vol. 68, 03 December 1997.
X.M. Melkemi and Y. M. Djebali, “Computing the shape of a planar
points set,” Elsevier Science, vol. 33, 9 September 2000.
X.Edelsbrunner, Y. D. G. Kirkpatrick, and Z. R. Seidel, “On the
shape of a set of points in the plane,” IEEE Transaction on
Information Theory, vol. 29, July 1983
X.A.Moreira and Y.Maribel Yasmina Santos, “Concave hull: A knearest neighbours approach for the computation of the region
occupied by a set of points,” GRAPP-International Conference on
Computer Graphics Theory and Applications, 2007.
X. M. Duckham, Y. L. Kulikb, and Z. M. Worboysc, “Efficient
generation of simple polygons for characterizing the shape of a set of
points in the plane,” Elsevier Science Inc. New York, NY, USA, vol. 41,
15 March 2008.