Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Time-focused density-based
clustering of trajectories
of moving objects
Margherita D’Auria
Mirco Nanni
Dino Pedreschi
Plan of the talk

Introduction




Density-based clustering on trajectories



Trajectory data model distance measure
Results
Temporal Focusing



Motivations
Problem & context
Density-based Clustering (OPTICS)
A clustering quality measure
Heuristics for optimal temporal interval
Conclusions & future work
2
Motivations

Plenty of actual and future data sources for
spatio-temporal data

Sophisticated analysis method are required, in
order to fully exploit them
 Data
mining methods
 Which kind of patterns/models?

Main objectives
 A better understanding of
the application domain
 An improvement for private and public services
3
Problem & context

A distinguishing case: Mobile devices



PDAs
Mobile phones
LBS-enabled devices (may include the two above)

They (can) yield traces of their movement

An important problem:



Discovering groups of individuals that (approx.) move together in some
period of time
E.g.: detection of traffic jams during rush hours
A candidate Data Mining reformulation of the problem

Clustering of individuals’ trajectories
4
Which kind of clustering?

Several alternatives are available

General requirements:
 Non-spherical


clusters should be allowed
E.g.: A traffic jam along a road
It should be represented as a cluster which individuals form a
“snake-shaped” cluster
 Tolerance
to noise
 Low computational cost
 Applicability to complex, possibly non-vectorial data

A suitable candidate: Density-based clustering
 In
particular, we adopt OPTICS
5
A crushed intro to OPTICS

A density threshold is defined through two parameters:
 ε: A neighborhood radius
 MinPts: Minimum number of points

Key concepts:

Core objects


Reachability-distance reach-d( p, q )


Objects with a ε-Neighborhood that contains at least MinPts objects
(simplified definition:) Distance between objects p and q
Example:



Object “q” is a core object if MinPts=2
Object “p” is not
Their reach-d() is shown
ε
q
reach-d(p,q)
p
ε –neighborhood of q
6
A crushed intro to OPTICS
The algorithm:
1.
Repeatedly choose a non-visited random object, until a core object
is selected
2.
Select the core object having the smallest reachability distance
from all the visited core objects. If none can be found, go to step 1
Order of visit
Output: reach-d() of all visited points
(reachability plot)
“jump” from left-hand group (0-9)
to right-hand one (10-18)
Reachability
threshold
Cluster 1
Cluster 2
7
Applying OPTICS to trajectories

Two key issues have to be solved
 A suitable representation for

Which data model for trajectories?
 A mean


trajectories is needed
for comparing trajectories has to be provided
Which distance between objects?
OPTICS needs to define one to perform range queries
8
A trajectory data model

Raw input data:



Each trajectory is represented as a set of time-stamped coordinates
T=(t1,x1,y1), …, (tn, xn, yn) => Object position at time ti was (xi,yi)
Data model

Parametric-spaghetti: linear interpolation between consecutive points
9
A distance between trajectories

Adopted distance = average distance
D( 1 , 2 ) |T

d ( (t ),


T
1
2
(t )) dt
|T |
It is a metric => efficient indexing methos allowed
10
A sample dataset

Set of trajectories forming 4 clusters + noise

Generated by the CENTRE system (KDDLab software)
11
OPTICS vs.
HAC & K-means
K-means
HAC-average
OPTICS
12
Temporal focusing

Different time intervals can show different
behaviours
 E.g.:
objects that are close to each other within a time
interval can be much distant in other periods of time

The time interval becomes a parameter
 E.g.:

rush hours vs. low traffic times
Problem: significant time intervals are not always
known a priori
 An
automated mechanism is needed to find them
13
Temporal focusing

The proposed method
1.
Provide a notion of interestingness to be
associated with time intervals

2.
We define it in terms of estimated quality of the clustering
extracted on the given time interval
Formalize the Temporal focusing task as an
optimization problem

Discover the time interval
interestingness measure
that
maximizes
the
14
A quality measure for
density-based clustering

General principle


High-density clusters separated by
low-density noise are preferred
The method

High-density clusters correspond to
low dents in the reachability plot
=> Evaluate the global quality Q of the
clustering output as the average
reachability within clusters (noise
is discarded)

HIGH
DENSITY
MEDIUM
DENSITY
LOW
DENSITY
Definition: given ε and dataset D, compute QD, ε as:
QD, ε = - R (D, ε’) = - AVGo in D’ reach-d(o)
D’ = D – {noise objects}
15
FAQs

How Q() is computed for a given time interval I ?




How is the reachability threshold set for each interval?




Step 1: trajectory segments out of I are clipped away
Step 2: OPTICS is run on the clipped trajectories
Step 3: Q(I) is computed on the output reachability plot
A reachability threshold is needed in order to locate clusters (and noise)
The threshold for the largest I is manually set by the user
Thresholds for other intervals I’ I are computed from the first one by
proportionally rescaling w.r.t. average reachability
Is the optimal Q(I) biased towards tiny intervals?
Yes. The problem has been fixed by defining Q’(I) = Q(I) / log |I|
=> A small decrease in Q(I) is accepted when it yields a much larger I

16
Esperiments

A more complex sample dataset (generated by CENTRE)
 Clear clusters in the central time interval vs. dispersion on the borders
17
Optimizing Q()

Find the optimal Q() by plotting values for all time intervals
 The optimum corresponds to the central time interval
18
Heuristics for optimum search

Each Q() value computation requires a run of the OPTICS algorithm

Computing all O(N2) values is too expensive

Alternative approaches are needed

Preliminary tests with hill-climbing (i.e., greedy) approach:
starting
points
local
optima
global
optimum
(N=|{sub-intervals}|)

Test on the same dataset

Global optimum found in the
70,7% of runs

Avg. number of steps: 17

Avg. OPTICS runs: 49
19
Conclusions & Future works

Summary of the work
 Extension
of OPTICS to a trajectory data model & distance
 Definition of the Temporal Focusing problem
 Definition of a clustering quality measure
 (Preliminary) Tests with exhaustive & greedy optimization

Future work
 Experimental validation over
broader benchmarks
 Tighter integration between OPTICS and search strategy
 Alternative, domain-specific definition of quality measures
20