Download Discovering Regular Groups of Mobile Objects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
Discovering Regular Groups of Mobile Objects
Using Incremental Clustering
Sigal Elnekave, Mark Last, Oded Maimon, Yehuda Ben-Shimol, Hans Einsiedler, Menahem
Friedman, Matthias Siebert
define an algorithm for building it. Then we define a new
data-amounts-based similarity measure between trajectories
according to proximity of trajectories in time and space. This
measure allows the discovery of groups that have similar
spatio-temporal behavior on a regular basis. We cluster
objects into groups. The objects are clustered according to
similarity between their periodic movement patterns. Finally
we evaluate the proposed algorithms by conducting
experiments on both synthetic and real data streams.
Abstract—As technology advances, detailed data on the
position of moving objects, such as humans and vehicles is
available. In order to discover groups of mobile objects that
usually move in similar ways we propose an incremental
clustering algorithm that clusters mobile objects according to
similarity of their movement patterns. The proposed clustering
algorithm uses a new, "data-amount-based" similarity measure
between mobile trajectories. The clustering algorithm is
evaluated on two spatio-temporal datasets using clustering
validity measures.
II.
Index Terms—Clustering, Mobile objects, Spatio-temporal
data mining.
Clustering is a mature data mining field that we use in this
work with some variations in order to fit it to the spatio
temporal environment. Jain et al.[1] present a literature review
on the subject of clustering. According to their review, a
clustering task involves the phases of pattern representation,
definition of a pattern proximity measure, clustering or
grouping, data abstraction, and assessment of the output.
Grouping can be done in a hard way, where each object is
assigned to only one cluster, or in a fuzzy way, where each
object can have different membership grades in each cluster.
The clustering can be implemented by hierarchical algorithms
that produce a nested series of merging or splitting clusters,
based on a similarity criterion, or by partitioning algorithms
that identify the partition, which optimizes a clustering
performance criterion.
Clustering validity assessments are usually objective and
are performed to determine whether the output is meaningful,
and cannot have occurred by chance. Statistical validation
methods include external assessment of validity, comparing
the structure to an a priori structure, an internal examination
of validity, checking whether the structure is intrinsically
appropriate for the data, and a relative test, which compares
two structures and measures their relative merits.
There exist several recent publications on clustering
moving objects and trajectories. The problem of clustering
moving objects is studied by Li et al. [2] who use moving
micro-clusters (MMC) for handling very large datasets of
mobile objects. A micro-cluster denotes a group of objects
that are not only close to each other at current time, but are
also likely to move together for a while. In principle, those
moving micro-clusters reflect some closely moving objects,
I. INTRODUCTION
W
ith technological progress, detailed data is available on
the location of moving objects at different times (e.g.,
humans and vehicles), either via GPS technologies, mobile
computer logs, or wireless communication devices. This
creates an appropriate basis for developing efficient new
methods for mining the movement patterns of moving objects.
Spatio-temporal data can be used for many different
purposes. The discovery of patterns in spatio-temporal data,
for example, can greatly enhance the knowledge in fields such
as animal migration analysis, weather forecasting, and mobile
marketing. Clustering spatio-temporal data can also help in
social groups' discovery, which is used in tasks like shared
data allocation, targeted advertising, and personalization of
content and services.
Previous work on mining spatio-temporal data includes
querying data using special indexes for efficient performance,
recognizing trajectory patterns, and clustering trajectories of
closely moving objects as 'moving micro-clusters'. Extensive
research has been done on spatial data and temporal data
separately. Periodicity has been studied only with time series
databases. The field of spatio-temporal mining is relatively
young, and requires much more research.
The goal of this work is to discover regular groups of
moving objects. In order to achieve this goal we first use a
compact representation of a spatio-temporal trajectory and
Manuscript received October 28, 2007. This work was supported in part by
Deustche Telekom AG Labs.
978-1-4244-1799-5/08/$25.00 ©2008 IEEE
RELATED WORK
197
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
naturally leading to high quality clustering results. The
authors of [2] propose incremental algorithms to keep the
moving micro clusters geometrically small by identifying split
events when the bounding rectangles reach some boundary
and by using knowledge about collisions between the MMCs
(splitting or merging MMCs when those events occur). In
experiments conducted on synthetic data with the K-Means as
the generic algorithm used in micro-clustering, MMCs
showed improvement in running times compared to NC
(normal clustering), though with a slight deterioration in
performance. However, MMCs may only help in finding
groups that move together for a certain continuous period of
time. They are less useful for the task of discovering groups of
objects with similar movement patterns, which move together
occasionally rather than constantly.
The problem of trajectory clustering is also considered by
Nanni et al [3]. They propose clustering trajectory data using
density-based clustering, based on the distance between
trajectories. Their OPTICS system uses reachability distance
between points and presents a reachability plot showing
objects ordered by visit times (X) against their reachability
measure (Y), allowing the users to see the separation to
clusters in order to decide on a separation threshold. The
authors of [3] propose to cluster patterns by the temporal
focusing approach to improve the quality of trajectories. Some
changes to OPTICS are suggested, by focusing on the most
interesting time intervals instead of examining all intervals,
where the interesting intervals are those with the optimal
quality of the obtained clusters. A comparison between KMeans, three versions of hierarchical agglomerative
clustering, and the trajectory version of OPTICS show that
OPTICS improves purity, with a decrease in completeness.
The SCUBA algorithm of Nehme and Rundensteiner [4] is
proposed for efficient cluster-based processing of large
numbers of spatio-temporal queries on moving objects. The
authors describe an incremental cluster formation technique
that efficiently forms clusters at run-time. Their approach
utilizes two key thresholds, distance and speed. SCUBA
combines motion clustering with shared execution for query
execution optimization. Given a set of moving objects and
queries, SCUBA groups them into moving clusters based on
common spatio-temporal attributes.
To optimize the join execution, SCUBA performs a twostep join execution process by first pre-filtering a set of
moving clusters that can produce good results in the joinbetween moving clusters stage and then proceeding with the
individual join-within execution on those selected moving
clusters. Experiments show that the performance of SCUBA is
better than the traditional grid-based approach where moving
entities are processed individually.
Anagnostopoulos et al. [5] use Minimum Bounding
Rectangles (MBRs) for approximating trajectories, while
defining the distance between two trajectory segmentations
s(Ti) and S(Tj) at time t as the distance d between the minimum
bounding rectangles at time t, where P ( S (T ), t ) is the
projection of the segment of trajectory T at time t on the x
axis. Formally:
(1)
d ( s (Ti ), s (T j ), t )
d ( xi , x j )
min
x i  P ( s ( Ti ), t )
xj  P ( s ( Tj ), t )
Finally, the distance between two segmentations is the sum
of the distances between them at every time instant:
m1
d (s(Ti ), s(T j ))
¦ d (s(T ), s(T ), t )
i
j
(2)
According to [5], the distance between the minimum
bounding rectangles is a lower bound of the original distance
between the raw data, which is an essential property for
guaranteeing correctness of results for most mining tasks.
t 0
III. AN ALGORITHM FOR BUILDING AN MBB-BASED
TRAJECTORY REPRESENTATION
In this paper, we define a periodic spatio-temporal
trajectory as a series of data-points traversed by a given
moving object during a specific period of time (e.g., one day).
Since we assume that a moving object behaves according to
some periodic spatio-temporal pattern, we have to determine
the duration of each spatio-temporal sequence (trajectory).
Thus, in the experimental part of this work, we assume that a
moving object repeats its trajectories on a daily basis, meaning
that each trajectory describes an object movement during one
day. In a general case, each object should be examined for its
periodic behavior in order to determine the duration of its
periodicity period. The training data window is the period
used to learn the object's periodic behavior based on its
recorded trajectories (e.g., daily trajectories recorded during
one month).
Similar to [5], we represent a trajectory as a list of minimal
bounding boxes. A minimal bounding box (MBB) represents
an interval bounded by limits of time and location. By using
this structure we can summarize close data-points into one
MBB, such that instead of recording the original data-points,
we only need to record the following six elements:
i.xmin
min(m  i, m.xmin )
i.xmax
max(m  i, m.xmax )
i. ymin
min(m  i, m. ymin )
i. ymax
max(m  i, m. ymax )
i.tmin
min(m  i, m.tmin )
i.tmax
max(m  i, m.tmax )
(3)
Where i represents an MBB, m represents a member in a
box, x and y are spatial coordinates, and t is time.
Summarizing a spatio-temporal dataset that records locations
of multiple objects at high frequency (e.g., each second) can
significantly reduce running times of data mining algorithms.
We have developed a new algorithm for summarizing a
198
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
bound of a given MBB is extended until the time or the space
distance between the subsequent data-point and the maximal
bounds of that MBB reaches some pre-defined segmentation
thresholds. When a threshold is exceeded in at least one of the
dimensions, a new minimal bounding box is created with the
time of the subsequent data-point as its minimal time bound.
The larger the threshold is, the more summarized the
trajectories are, meaning that we increase the efficiency of the
next mining stages (shorter running times) while potentially
decreasing their accuracy.
We chose to set time thresholds in order to limit the time
range of a given MBB. By removing this threshold we allow
the existence of MBBs that span a long period of time even
though they may contain a very limited amount of location
data on a given object.
Notice that there are two cases for summarizing
(segmenting) trajectories:
1. Summarizing raw data collected during a single period
(e.g. one day) into one segmentation.
2. Summarizing raw data collected during multiple
periods (e.g. 30 days) into one segmentation.
Both cases are summarized in the same manner. The second
option leads to coarser summarization since more data is
summarized into a single segmentation.
We present an enhanced incremental algorithm for
representing an object trajectory as a set of MBBs from a
spatio-temporal dataset D covering object movements during a
predefined period (e.g., 24 hours).
The algorithm processes each data point in the data stream
and inserts it into an existing MBB as long as its distance from
the MBB bounds are within the thresholds defined as the
algorithm parameters; otherwise it creates a new MBB. In our
previous work [6] we analyzed the selection of the threshold
and its effect on the summarization resolution.
The "lastMBB" function returns the MBB with the maximal
(latest) time bounds in the trajectory, the "addMBB" function
initializes a new MBB in the trajectory with bounds and
properties updated by the first incoming data-point (on the
first arrival, the minimum and the maximum are equal to the
data-point values), and the "addPoint" function updates an
existing MBB properties (bounds and data amount) according
to the inserted data-point.
trajectory and setting the MBB bounds. As opposed to earlier
MBB-based representations, we add other properties to the
standard MBB-based representation that improve our ability to
perform operations on the summarized data, like measuring
similarity between trajectories or discovering exceptional data
points. The additional properties are calculated by aggregation
methods:
i.p = aggregation( m  i, m.state)
(4)
Where p stands for the value of a property variable in a
minimal bounding box i, m represents a member in a box, and
state is the data-point property that is being aggregated. In our
algorithm, p represents the number of data points (data#) that
are summarized by a given MBB:
i.data# = count( m  i, 1)
(5)
Therefore when segmenting the original trajectories into
MBBs we also count the amount of data points summarized
by each MBB.
As can be seen from the time bounds in Equation (3), we
use MBBs having irregular, rather than constant, time
intervals. Constant time intervals may facilitate processing
operations like measuring similarity between trajectories, but
at the price of setting forced bounds. Blurring the real data
bounds may cause the separation of close data points and the
union of distant data points. Since we would like the
preprocessing stage to maintain as much precision as possible,
we chose to use irregular time bounds.
A periodic (daily) trajectory of an object is identified by an
object ID O and a date D, and it can be stored as a list of n
MBBs:
{O1, D1, [t1m,t1M,x1m,x1M,y1m,y1M,N1],
[t2m,t2M,x2m,x2M,y2m,y2M,N2]..,
[tnm,tnM,xnm,xnM,ynm,ynM,Nn]}
(6)
where t represents time, x and y represent coordinates, m is
used for minimal and M for maximal, and N represents the
amount of data points belonging to each MBB. Figure 1
demonstrates an object's trajectory and its MBB-based
representation for a given period.
Y
X
Input: a spatio-temporal dataset (D) that describes object movements
along a period of time, a threshold of x and y distances and of time
duration of an MBB.
Output: a new trajectory (T)
Building an object's trajectory:
itemÅ D[1]
T.addMBB(item)
--First item updates first MBB
For each item in D
--Except for first item
while(|item.X-T.lastMBB.maxX|<XdistThreshold
and |item.Y-T.lastMBB.maxY|<YdistThreshold
and |item.T-T.lastMBB.maxT|<durationThreshold)
T.lastMBB.addPoint(item)
--Insert into current MBB
T.addMBB(item)
--Create MBB when out of thresholds
T(hours)
Fig. 1. An object's trajectory
Incoming
data-points
update
the
MBB-based
representations in the order of their arrival times. Therefore,
the minimal time bound of the first MBB is the time of the
earliest data-point in the trajectory and the maximal time
199
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
For example, when summarizing the cars dataset we read
the data describing each car separately in the order of its
sampling times. We summarize the location records into a
single MBB as long as they are close enough to each other (do
not exceed some pre-determined distance threshold from the
MBB's current bounds), when a record is far enough from its
earlier records, we start summarizing it into a new created
MBB.
The computational time complexity of processing n datapoints is O(n) . In the end of this preprocessing stage we
obtain a set of non overlapping MBBs that we refer to as
“trajectory segmentation”.
tm
A
1
tn
2
B
Minimal
Distance
2
1
Fig. 3. A. Times of overlapping between two MBBs; B. Minimal distance
between two MBBs
We define the minimal distance (minDsim) between two
MBBs as:
IV. SIMILARITY BETWEEN TRAJECTORIES
In the following section we describe algorithms for
clustering spatio-temporal items, where our items are mobile
objects. In order to cluster such items we have to define a
similarity measure that will enable the discovery of similar
objects.
We define the similarity between two trajectories as the
sum of the similarities between the time overlapping MBBs,
divided by the product of the amount of MBBs in each of the
compared trajectories, where trajectories may differ in their
MBB amounts. The two compared trajectories are described
as shown in Figure 2.
(8)
minDsim ( M BB (Ti ), M BB (T j ))
X minDsim ( M BB (Ti ), M BB (T j )) Y minDsim ( M BB (Ti ), M BB (T j ))
where the minimal distance between two MBBs in x and y
dimensions, respectively is calculated as the minimal distance
between two MBBs in that dimension, or as zero if the two
MBBs overlap in that dimension:
(9)
XminD(MBB(Ti ), MBB(Tj ))
1 max(0,(max(MBB(Ti ).Xmin , MBB(Tj ).Xmin ) min(MBB(Ti ).Xmax , MBB(Tj ).Xmax ))) / rangeX
(10)
YminDsim(MBB(Ti ), MBB(Tj ))
1 max(0,(max(MBB(Ti ).Ymin , MBB(Tj ).Ymin ) min(MBB(Ti ).Ymax , MBB(Tj ).Ymax ))) / rangeY
rangeX and rangeY are the possible range of X and Y
coordinates correspondingly for all mobile objects in the
training window. Using the enhanced representation of
trajectories we can improve the similarity measure between
trajectories as follows. We multiply the minimal-distances
measure (7) by the difference between the amounts of data
points of the two compared MBBs (data#sim). Since each
MBB summarizes some data points, the more data points are
included in both of the compared MBBs, the stronger support
we have for their similarity. Our "data-amount-based" distance
is calculated as:
Fig. 2. Measuring similarity between two trajectories. Arrows represent
minimal distances between two overlapping MBBs.
We suggest a new similarity measure for measuring
similarity between two overlapping MBBs, based on the
similarity between segments as described in [5]. If we treat
each MBB as a segment we can use the following formula,
where tm is the time when the two MBBs start to overlap, tn is
the time when their overlapping ends and rangeT is the
possible range of times for all mobile objects in the training
window:
sim(MBB(Ti),MBB(Tj)) =
minDsim(MBB(Ti),MBB(Tj))· |tm–tn|/rangeT)
sim ( MBB (Ti ), MBB (T j ))
minDsim( MBB (Ti ), MBB (T j )) ˜ t m t n / Trange
(11)
˜ data# sim( MBB (Ti ), MBB (T j ))
(7)
where the similarity between the amounts of data points that
are summarized within two MBBs is:
The minimal distance and the times tm and tn are described
in Figure 3.
data# D(MBB(Ti ), MBB(Tj )) min(MBB(Ti ).data #, MBB(Tj ).data #)
200
(12)
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
described bellow:
The "data-amount-based" similarity measure for measuring
similarity between two MBBs has time and space complexities
of O 1 , and it is computed as follows:
Input: two MBBs (M1,M2), possible ranges of x,y,t and data-amount (for
normalization)
Output: the MBBs similarity measure
MBB-similarity:
simX 1 max(0, (max(M 1.xmin , M 2 .xmin ) min(M 1.xmax , M 2 .xmax )) / xRange)
simY
1 max( 0, (max( M 1 . y min , M 2 . y min ) min( M 1 . y max , M 2 . y max )) / yRange )
1 max(0, (min( M 1 .t max , M 2 .t max ) max(M 1.t min , M 2 .t min )) / tRange)
sim # min(M1.data #, M 2 .data #)
return ((simX + simY) /2 * SimT * sim#)
simT
An algorithm for measuring the similarity between two
trajectory segmentations has O m time complexity in the
Input: two objects (O1,O2), each object has several movement patterns
Output: the objects similarity measure
Objects-similarity:
T1=0 --Initialization of the first compared trajectory
while (T1<O1.trajectories#)
--For each trajectory of O1
T2=0 --Initialization of the second compared trajectory
while (T2< O2.trajectories# ) --For each trajectory of O2
similarity += trajectories-similarity(O1(T1),O2(T2))
--Proceeding to the next trajectory of O2
T2++
T1++ --Proceeding to the next trajectory of O1
return similarity/ O1.trajectories#/ O2.trajectories#
where O1 and O2 are the two compared objects, T1 and T2 are
the
two
currently
compared
movement patterns
(segmentations) of O1 and O2 respectively, the trajectoriessimilarity function returns the similarity of the two compared
segmentations according to the previous pseudo code and the
trajectories# attribute returns the amount of trajectories that
belong to a given object.
The algorithm's worst case time complexity is O(tm) (and
O(1) space complexity), where t is the maximal amount of
trajectories according to the compared objects O1 and O2, and
m is the maximal amount of MBBs per trajectory.
For the task of clustering multi-pattern objects according to
the similarity between their trajectories, as presented in the
following sections, we need to compare the similarity between
an object and a cluster centroid. Since our centroid is
represented as segmentation, as also defined in the following
sections, we compare two objects: one may have several
segmentations, and the other (the centroid) is represented by
only one segmentation. This reduces the worst case time
complexity to O(tm), where t is the amount of segmentations
that belong to the object and m is the maximal amount of
MBBs in a segmentation.
worst case (and O 1 space complexity), where m is the
maximal amount of MBBs in each of the two compared
trajectories. The algorithm is described bellow:
Input: two trajectories (T1,T2)
Output: the trajectories similarity measure
Trajectories-similarity:
MBB1=0, MBB2=0--initialization of the currently compared MBBs
while (MBB1<T1.MBB# and MBB2< T2.MBB#)
while (T1(MBB1).maxT < T2 (MBB2.minT)
MBB1++ --While MBB1 before MBB2: move to next MBB in T1
while (T2(MBB2).maxT < T1 (MBB1).minT)
MBB2++--While MBB2 before MBB 1: move to next MBB in T2
similarity += MBB-similarity(T1(MBB1),T2 (MBB2))
if (T1(MBB1).maxT< T2 (MBB2).maxT)
MBB1++
else MBB2++ --Proceeding to the next MBB after an overlap
return similarity/ T1.MBB#/ T2.MBB#
where T1 and T2 are the two compared trajectories, MBB1 and
MBB2 are the locations of the two currently compared MBBs
in T1 and T2 respectively, the MBB# attribute returns the
amount of MBBs in a given trajectory, and the MBB-similarity
function returns the similarity between the two given MBBs
according to the previous pseudo code.
The similarity between two objects where each object has
one periodic movement pattern can be calculated according to
the previous algorithm (similarity between two trajectory
segmentations).
In this paper we assume that each mobile object is
represented by a single segmentation which is a
summarization of its trajectories, or in other words, its
movement pattern.
In the general case, where each object may have more than
one movement pattern, the similarity between two objects is
calculated as the sum of similarities between each two
possible movement patterns (segmentations) divided by the
product of the amounts of movement patterns in each of the
objects (the number of comparisons). The algorithm is
V. REPRESENTING A CLUSTER'S CENTROID
We represent a cluster centroid as a segmentation, or in
other words as a vector of MBBs. Each MBB is an interval
that holds information about the upper and lower bounds in
each one of the d-dimensions (in our case 2-D) of the location,
lower and upper time bounds, and the amount of data-points
that are summarized within the MBB.
As opposed to a movement pattern which is a set of non
overlapping MBBs, a centroid is a set of MBBs that can
overlap in case where several locations (trajectories) are
allowed during the same time interval.
During the clustering of spatio-temporal items (trajectories
of mobile objects) several similar items are inserted into each
cluster. First, the cluster's centroid is initialized to the first
inserted item, and after all items are clustered, the cluster
centroids need to be updated.
We defined a new representation for a centroid as a vector
201
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
no subsequence MBB in a given centroid.
of intervals, instead of the common representation of a
centroid as a vector of numeric values. The common updating
method calculates a cluster centroid as a vector of averages,
where the average is calculated according to the values of the
vectors that belong to the cluster. We cannot use this
averaging technique with our suggested representation of
cluster centroids, since averaging interval bounds will lead to
an invalid interval, with bounds that are not the real limits of
the interval. Therefore we define a new algorithm for updating
the cluster centroids using a bounding technique instead of
averaging.
We first sort all the MBBs that belong to items in a cluster.
MBBs with larger intervals will appear first (sorted by time
dimension, then by X dimension and finally by Y dimension).
Then, each MBB is added into the cluster's centroid
(represented as segmentation) in the following manner:
If the inserted MBB is within the scope of another MBB it
will only update the amount of summarized data points in the
existing MBB,
If the inserted MBB exceeds an existing MBB scope within
an allowed distance (less than the pre-determined thresholds)
it updates the amount of summarized data points in the
existing MBB, and it also updates the exceeded bounds of the
existing MBB. Otherwise, the inserted MBB is added as a new
MBB.
The centroid updating algorithm is as follows:
VI. CHOOSING A CLUSTERING ALGORITHM
In order to perform clustering we had to choose one
clustering technique among the wide variety of existing
algorithms,
According to Jain et al. [1] the partitioning K-Means
algorithm has been used to cluster large data sets. The reasons
behind the popularity of the K-Means algorithm are mainly its
relatively low time and space complexities. K-Means time
complexity is O(nkl), where n is the number of data-points, k
is the number of clusters, and l is the number of iterations
taken by the algorithm to converge. Typically, k and l are
fixed in advance and so the algorithm has linear time
complexity in the size of the data set. K-Means is also orderindependent; for a given initial seed set of cluster centers, it
generates the same partition of the data irrespective of the
order in which the patterns are presented to the algorithm.
However, the K-Means algorithm is sensitive to initial seed
selection and even in the best case, it can produce only hyperspherical clusters.
Hierarchical algorithms are more versatile. But they have
some disadvantages. Their time complexity is O(n2logn), and
their space complexity is O( n 2 ) . This is because a similarity
matrix of size n u n has to be stored.
We chose to use the K-Means algorithm mainly for its
reduced time and space complexities. The decision was
reinforced by the high cost of similarity measures in the
spatio-temporal environment.
Segmentations are a summarized version of the original
spatio-temporal data. In the task of clustering trajectories (or
objects with a single trajectory), we can cluster t
segmentations (or objects), each contains up to m MBBs, with
a time complexity of O mtkl , where k is the number of
Input: previous centroids, current clusters
Output: new centroids
Update-centroids:
For each c in current-clusters
c.sortByArea ()
For each MBB in c
if new-centroids[c.id].empty() --If cluster is empty insert first MBB
then new-centroids[c.id].add(MBB)
else tempMBB Ånew-centroids[c.id].firstMBB() --Otherwise check
while(not MBB.withinMBB(tempMBB))
--Can the current centroid's MBB contain the new MBB?
tempMBB Ånew-centroids[cluster.id].nextMBB()
--If no:check next MBB in the centroid
if is-last(tempMBB)--After checking all centroid MBBs
then new-centroids.addMBB(MBB) --Insert a new MBB
break while
if MBB.withinMBB(tempMBB) --Found a bounding MBB
then tempMBB.data#= tempMBB.data# + MBB.data#
else tempMBB.updateMBB(MBB)--Found partly bounding
clusters, and l is the number of iterations taken by the
algorithm to converge. The algorithm's space complexity
is O k mt , where each of the t trajectories requires space
for up to m MBBs.
In the task of clustering objects where each object has
several segmentations, we can cluster s objects (each has up to
t segmentations with up to m MBBs) with time complexity of
O tmskl and space complexity of O k mts .
where c is the current updated cluster, id returns the cluster
number, the function empty returns true if the centroid of the
cluster is empty, the function add adds an MBB to the updated
centroid in new-centroids, the function within-MBB returns
true if a given MBB is inside the bounds of another given
MBB. tempMBB holds an MBB that is currently being
checked for containing another MBB, the data# attribute
holds the amount of summarized data-points within the given
MBB, the function updateMBB updates the bounds and
properties (data#) of a given MBB according to a given
inserted MBB, and the function is-last returns true if there is
VII. CLUSTERING OBJECTS IN ORDER TO RECOGNIZE REGULAR
GROUPS OF OBJECTS
Object clusters contain objects that have similar trajectories
or movement patterns. Objects in the same cluster are
frequently (in most of the time intervals) close in space. An
objects cluster represents a group of moving objects that use
similar trajectories but not necessarily move together in the
same trajectory all the time.
202
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
results using the Rand index that measures clustering accuracy
compared to the ground truth. This index performs a
comparison of all pairs of objects in the data set after
clustering. "Agreement"(A) is a pair that is together or not
together in both the real and the measured clusters,
"disagreement"(D) is the opposite case.
The Rand index is computed as: [8]
In order to recognize groups of mobile objects that are often
close in space to each another, we can operate in one of the
two described options:
We can cluster mobile objects according to similarity
between their trajectories in order to find groups of objects
with similar trajectories. We first summarize the raw data
using our suggested algorithm as described in section III. Each
object's data-points (within the training window) are
represented as one segmentation, assuming one movement
pattern per each object. Then we cluster the objects according
to their summarized trajectories (segmentations). As explained
in section VI, this requires O tmkl time complexity when
R
2)
The TRAJECTORIES dataset
In order to better control the data and to allow the
evaluation of clustering results by a comparison with a
predefined ground truth, we generated the synthetic
TRAJECTORIES dataset. By setting movement formulas by
ourselves we could design the trajectories and make sure that
objects belong to their intended group, and that they use
similar trajectories along several time periods. We used
movement formulas of the form:
A. Evaluation methods
1) Dunn index
The Dunn index measures the overall worst-case
compactness and separation of a clustering, with higher values
being better. [7]
Dmin
Dmax
(14)
Obtaining spatio-temporal data
1) The INFATI dataset
We only found one available real spatio-temporal dataset
that fits to our purposes, mainly due to privacy issues. The
INFATI dataset we used [9] is a real dataset that contains
information about a group of cars and their locations during a
period of three weeks. The INFATI data contains GPS logdata from 11 cars. This data was collected during December
2000 and January 2001. All cars were driving in the
municipality of Aalborg, which includes the city of Aalborg,
suburbs, and some neighboring towns. The collected data
encompasses a range of 2017 KM × 3263 KM ( x × y ).
For more than a month, the movement of each car was
registered in the car’s database. When a car was moving, its
GPS position was sampled every second. The GPS positions
were stored in the Universal Transverse Mercator (UTM 32)
format. No sampling was performed when a car was parked.
VIII. EXPERIMENTS
I
A
A D
B.
clustering t objects, each contains one summarized trajectory
with up to m MBBs, where k is the number of clusters, and l is
the number of iterations taken by the algorithm to converge.
We can cluster the objects after abnormal behaviors were
removed from the movement patterns or the summarized
trajectories. By removing exceptions we significantly decrease
the size of input for the objects clustering algorithm, and
therefore we improve its efficiency, but at the cost of
removing data-points from the input and therefore risking with
a reduction in correctness.
We can detect an exceptional MBB by its "data amount"
property that records the amount of data points that are
summarized within an MBB during the training window. If an
object is frequently found in this location at this time, the
"data amount" of the MBB will be a large number, but if the
object rarely reaches this location at this time, the "data
amount" of the MBB will be low. We can use this property for
recognizing sparse or exceptional MBBs.
D
I
t1
(13)
t0 ˜ rate noise
x1
x0 t1 ˜ vx noise
y1
y0 t1 ˜ v y noise
(15)
where x0, y0 are the previous coordinates, x1,y1 are the current
coordinates, t0 is the time when the previous data point was
sampled, and t1 is the time when the current point is sampled.
Vx, Vy are the velocities of the movement on x and y axes
(that change along the movement), and rate is the time
between samplings. The data is asynchronous. noise is a
number randomly chosen from a range that is defined as 15
percents of the data range in the corresponding dimension.
The first 13 objects were tracked along 45 days (between
day 1 and day 45) and the next 10 objects were tracked along
25 days (between day 1 and day 25). Each object was sampled
at least 35 times during each day, and between 950 and 1050
where Dmin is the minimum distance between any two objects
in different clusters (seperation) and Dmax is the maximum
distance between any two items in the same cluster
(homogeneous). The numerator captures the worst-case
amount of separation between clusters, while the denominator
captures the worst-case compactness of the clusters.
2) Rand index
Since clustering is an unsupervised machine learning
technique, there is no set of correct answers that can be
compared to the clustering results. We can, though, generate
data sets by some mechanism that will help building the
correct partition into groups, and then test the clustering
203
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
TABLE II
PARAMETERIZATION OF THE TRAJECTORIES DATASET
bound stdev(D) ˜ (max(D) min(D)) ˜ 0.4
(for trajectories
samples in total.
The proposed algorithm is evaluated on these two datasets
as described in the following sections.
The synthetic TRAJECTORIES dataset was used for testing
the algorithms for discovering regular groups. In order to
enable measuring the clusters' correctness we created the
object's 45 trajectories with a clear distribution into three
groups
with
similar
movement
patterns:
(5,10,15,20,25,30,35,40,45),
(1,2,4,6,7,9,11,12,14,16,17,19,21,22,24,26,27,29,31,32,34,36,
37,39,41,42,44), and (3,8,13,18,23,28,33,38,43), as can be
seen in Figure 4.
B
segmentation and for
centroid segmentation)
K-Means iterations
amounts(i)
K-Means
clusters
amount (k)
exceptions bound
Evaluation experiments
For the evaluation of our algorithm for finding regular
groups (using our enhanced spatio-temporal clustering
algorithm), we clustered mobile objects according to their
trajectories (assuming one movement pattern per each object)
using both the real-data INFATI dataset, and the synthetic
TRAJECTORIES dataset. We first summarized each mobile
object's data-points along the training window into one
segmentation, we then removed exceptions from the
summarized trajectories, and finally we clustered the objects
according to similarity between their trajectories, using our
"data-amount-based" similarity measure.
Simulations on the INFATI data were set as described in
Table 1. The dataset holds information on locations of 11 cars
along a training window of two months.
amount of datapoints
amount of datapoints - 2
k ˜ 10
IX. CONCLUSION
In this work we presented a new way for summarizing
periodic spatio-temporal data, including a new similarity
measure between summarized trajectories.
Our proposed incremental algorithm for clustering spatiotemporal items was shown to work well. We received high
cluster's validity results for clustering regular groups using our
enhanced clustering algorithm.
TABLE I
PARAMETERIZATION OF THE INFATI DATASET
bound stdev(D) ˜ (max(D) min(D)) ˜ 0.2
(for trajectories
bounds
and centroid segmentation)
iterations amounts(i)
clusters amount (k)
exceptions bound
2, 3, 4, 5, 6, 7
Notice that tuning the k parameter is beyond the scope of
this paper. Existing methods [10] try to estimate the right
amount of groups in order to optimize k.
Simulations on the TRAJECTORIES data were set as
described in Table 2. The dataset holds information on
locations of 45 mobile objects along a training window of one
day. Each object belongs to one of three groups according to
its movement pattern, and to one of five groups according to
its abnormal behavior.
Using the groups recognized by visualization techniques as
ground truth led to the following clustering results in the task
of clustering the INFATI dataset's objects according to
similarity between their movement patterns: 46.8% average
Rand index, average Dunn index of 0.99, and average run
time of 51 seconds.
The obtained correctness of the results according to the
Rand index is relatively low, mainly due to the fact that our
ground truth depends on decisions made by eyesight, which
can be misleading. Our clustering algorithm may have found
more accurate groups than the groups of "ground truth" that
were found by visualization techniques and may be
inaccurate.
This problem is avoided when using the synthetic
TRAJECTORIES dataset that was created with a clear
distribution of its 45 moving objects (each has one moving
pattern) into three groups with similar movement patterns.
Simulations on the synthetic TRAJECTORIES dataset led
to the following clustering results when clustering objects
according to similarity within their movement patterns: 87.2%
average Rand index, average Dunn index of 0.94, and average
run time of 9 seconds. As we expected, the correctness is
much higher according to the Rand index.
Fig. 4. The TRAJECTORIES dataset in a 3-D visualization
C.
20
30
2, 3, 4, 5, 6, 7, 8
amount of datapoints
amount of datapoints - 2
k ˜ 10
204
PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)
REFERENCES
[1]
A.K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review",
ACM Computing Surveys (CSUR), v.31(3), 1999, pp.264-323
[2] X. Li, J. Han, S. Kim, Motion-Alert: “Automatic Anomaly Detection in
Massive Moving Objects”. ISI-2006, pp. 166-177
[3] M. Nanni, D. Pedreschi "Time-focused density-based clustering of
trajectories of moving objects" JIIS , Special Issue on "mining spatiotemporal data", vol.27(3), 2006, pp. 267-289
[4] R. Nehme and E.A.Rundensteiner . “SCUBA: Scalable Cluster-Based
Algorithm for Evaluating Continuous Spatio-Temporal Queries on
Moving Objects”, EDBT 2006, pp. 1001-1019
[5] A. Anagnostopoulos, M. Vlachos, M. Hadjieleftheriou, E. Keogh, P.s.
Yu "Global Distance-Based Segmentation of Trajectories". KDD 2006,
pp. 34-43.
[6] S. Elnekave, M. Last, O. Maimon "A Compact Representation of SpatioTemporal Data". To appear in ICDM Workshop on Spatial and SpatioTemporal Data Mining 2007 (SSTDM07), IEEE, 2007
[7] J.C.Dunn "A fuzzy relative of the ISODATA process and its use in
detecting compact well-separated clusters", J. Cybern,vol.3, 1973, pp.
32-57
[8] W.M.Rand "Objective criteria for the evaluation of clustering methods".
Journal of the American Statistical Association, vol. 66, 1971, pp. 846850.
[9] C. S.Jensen, Lahrmann H., Pakalnis S., and Runge J. "The INFATI
Data". Aalborg Univ., TimeCenter Technical Report. Available:
http://www.cits.aau.dk/download/INFATI.pdf. 2004 ,pp. 1-10.
[10] D. Pelleg and A. Moore, "X-means: Extending K-means with efficient
estimation of the number of clusters" Proc. 17th International
Conference on Machine Learning, 2000, pp.727–734.
205