Download Time-focused density-based clustering of trajectories of moving

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Time-focused density-based
clustering of trajectories
of moving objects
Margherita D’Auria
Mirco Nanni
Dino Pedreschi
Plan of the talk
Introduction
Density-based clustering on trajectories
Trajectory data model distance measure
Results
Temporal Focusing
Motivations
Problem & context
Density-based Clustering (OPTICS)
A clustering quality measure
Heuristics for optimal temporal interval
Conclusions & future work
2
Motivations
Plenty of actual and future data sources for
spatio-temporal data
Sophisticated analysis method are required, in
order to fully exploit them
Data
mining methods
Which kind of patterns/models?
Main objectives
A
better understanding of the application domain
An improvement for private and public services
3
Problem & context
A distinguishing case: Mobile devices
PDAs
Mobile phones
LBS-enabled devices (may include the two above)
They (can) yield traces of their movement
An important problem:
Discovering groups of individuals that (approx.) move together in some
period of time
E.g.: detection of traffic jams during rush hours
A candidate Data Mining reformulation of the problem
Clustering of individuals’ trajectories
4
Which kind of clustering?
Several alternatives are available
General requirements:
Non-spherical
clusters should be allowed
E.g.: A traffic jam along a road
It should be represented as a cluster which individuals form a
“snake-shaped” cluster
Tolerance
to noise
Low computational cost
Applicability to complex, possibly non-vectorial data
A suitable candidate: Density-based clustering
In
particular, we adopt OPTICS
5
A crushed intro to OPTICS
A density threshold is defined through two parameters:
ε: A neighborhood radius
MinPts: Minimum number of points
Key concepts:
Core objects
Reachability-distance reach-d( p, q )
Objects with a ε-Neighborhood that contains at least MinPts objects
(simplified definition:) Distance between objects p and q
Example:
Object “q” is a core object if MinPts=2
Object “p” is not
Their reach-d() is shown
ε
qq
reach-d(p,q)
ch
pp
ε –neighborhood of q
6
A crushed intro to OPTICS
The algorithm:
1.
Repeatedly choose a non-visited random object, until a core object
is selected
2.
Select the core object having the smallest reachability distance
from all the visited core objects. If none can be found, go to step 1
Output: reach-d() of all visited points
(reachability plot)
Order of visit
0.22
18
13
9
14
0.2
12
0.16
15
11
Y axis
3
0.14
10
1
16
2
17
0
4
5
“jump” from left- h
and group (0
- 9)
to right- hand one (10
- 18)
0.18
Reachability
threshold
0.12
0.1
Cluster 1
0.08
6
Cluster 2
0.06
8
7
X axis
0.04
0.02
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
7
Applying OPTICS to trajectories
Two key issues have to be solved
A
A
suitable representation for trajectories is needed
Which data model for trajectories?
mean for comparing trajectories has to be provided
Which distance between objects?
OPTICS needs to define one to perform range queries
8
A trajectory data model
Raw input data:
Each trajectory is represented as a set of time-stamped coordinates
T=(t1,x1,y1), …, (tn, xn, yn) => Object position at time ti was (xi,yi)
Data model
Parametric-spaghetti: linear interpolation between consecutive points
9
A distance between trajectories
Adopted distance = average distance
D(τ 1 ,τ 2 ) |T
d (τ (t ),τ
∫
=
T
1
2
(t ))dt
|T |
It is a metric => efficient indexing methos allowed
10
A sample dataset
Set of trajectories forming 4 clusters + noise
Generated by the CENTRE system (KDDLab software)
11
OPTICS vs.
HAC & K-means
K-means
HAC-average
OPTICS
12
Temporal focusing
Different time intervals can show different
behaviours
E.g.:
objects that are close to each other within a time
interval can be much distant in other periods of time
The time interval becomes a parameter
E.g.:
rush hours vs. low traffic times
Problem: significant time intervals are not always
known a priori
An
automated mechanism is needed to find them
13
Temporal focusing
The proposed method
1.
Provide a notion of interestingness to be
associated with time intervals
2.
We define it in terms of estimated quality of the clustering
extracted on the given time interval
Formalize the Temporal focusing task as an
optimization problem
Discover the time interval
interestingness measure
that
maximizes
the
14
A quality measure for
density-based clustering
General principle
High-density clusters separated by
low-density noise are preferred
The method
High-density clusters correspond to
low dents in the reachability plot
=> Evaluate the global quality Q of the
clustering output as the average
reachability within clusters (noise
is discarded)
LOW
DENSITY
HIGH
DENSITY
MEDIUM
DENSITY
Definition: given ε and dataset D, compute QD, ε as:
QD, ε = - R (D, ε’) = - AVGo in D’ reach-d(o)
D’ = D – {noise objects}
15
FAQs
How Q() is computed for a given time interval I ?
Step 1: trajectory segments out of I are clipped away
Step 2: OPTICS is run on the clipped trajectories
Step 3: Q(I) is computed on the output reachability plot
How is the reachability threshold set for each interval?
A reachability threshold is needed in order to locate clusters (and noise)
The threshold for the largest I is manually set by the user
Thresholds for other intervals I’ I are computed from the first one by
proportionally rescaling w.r.t. average reachability
Is the optimal Q(I) biased towards tiny intervals?
Yes. The problem has been fixed by defining Q’(I) = Q(I) / log |I|
=> A small decrease in Q(I) is accepted when it yields a much larger I
16
Esperiments
A more complex sample dataset (generated by CENTRE)
Clear clusters in the central time interval vs. dispersion on the borders
17
Optimizing Q()
Find the optimal Q() by plotting values for all time intervals
The optimum corresponds to the central time interval
18
Heuristics for optimum search
Each Q() value computation requires a run of the OPTICS algorithm
Computing all O(N2) values is too expensive
Alternative approaches are needed
Preliminary tests with hill-climbing (i.e., greedy) approach:
(N=|{sub-intervals}|)
100
80
starting
points
60
Test on the same dataset
Global optimum found in the
70,7% of runs
Avg. number of steps: 17
Avg. OPTICS runs: 49
Tend
40
local
optima
0
20
40
global
optimum
60
20
80
0
100
Tstart
19
Conclusions & Future works
Summary of the work
Extension
of OPTICS to a trajectory data model & distance
Definition of the Temporal Focusing problem
Definition of a clustering quality measure
(Preliminary) Tests with exhaustive & greedy optimization
Future work
Experimental
validation over broader benchmarks
Tighter integration between OPTICS and search strategy
Alternative, domain-specific definition of quality measures
20