Download Efficient Density-Based Clustering of Complex Objects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Efficient Density-Based Clustering of Complex Objects
Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle
University of Munich
Institute for Computer Science
Brighton,UK
November 01-04, 2004
Outline
• Density-Based Clustering
• Clustering of Complex Objects
• Experimental Evaluation
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Outline
• Density-Based Clustering
Core Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex Objects
• Experimental Evaluation
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Data Mining
• Larger and larger amounts of data collected automatically
Hubble Space Telescope
Telecommunication Data
Market-Basket Data
• Too large for humans to analyze manually
• Tools to assist analysis necessary  KDD / Data Mining
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Clustering
• Clustering
– Efficiently grouping the database into sub-groups (clusters) such that
• similarity within clusters maximized
• similarity between clusters minimized
Flat Clustering
Hierarchical Clustering
one level of clusters
nested clusters
e.g. density-based clustering algorithm
DBSCAN [KDD 96]
e.g. density-based clustering algorithm
OPTICS [SIGMOD 99]
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Density-Based Clustering I
•
Parameters
– range e and minimal weight MinPts
MinPts=5
•
Definition: core object
–
•
q
q is core object if | rangeQuery (q,e) |  MinPts
p
Definition: directly density-reachable
– p directly density-reachable from q if
MinPts=5
q
q is a core object and p  rangeQuery (q,e)
•
Definition: density-reachable
– density-reachable: transitive closure of
“directly density-reachable”
Martin Pfeifle, University of Munich
q
MinPts=5
o
r
ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such
that objects of a cluster are
adjacent in the ordering.
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such
that objects of a cluster are
adjacent in the ordering.
MinPts = 5
o
e
core-distance(o)
• Definition: core-distance
if | rangeQuery (o, e ) |  MinPts

core dist e , MinPts(o)  
MinPts  dist (o) otherwise
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such
that objects of a cluster are
adjacent in the ordering.
• Definition: core-distance
p
MinPts = 5
p
o
e
core-distance(o)
reachability-distance(p,o)
reachability-distance(p,o)
if | rangeQuery (o, e ) |  MinPts

core dist e , MinPts(o)  
MinPts  dist (o) otherwise
• Definition: reachability-distance
reach diste , MinPts ( p, o )  max(core diste , MinPts ( o ),dist( p, o ))
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
seedlist:
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
coredistance
A
e
K
I
M
L
J
N
P
R
A
seedlist: (B,40) (I, 40)
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
A B
seedlist: (I, 40) (C, 40)
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
A B I
seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
A B I
J
seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
…
N
P
R
A B I
J L
seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
A B I
J L M K NR P C D F G E H
seedlist: Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
OPTICS Algorithm
• Example Database (2-dimensional, 16 points)
• e = 44, MinPts = 3
reach

D
E G
F
H
C
44
B
K
A
I
M
L
J
N
P
R
A B I
J L M K NR P C D F G E H
seedlist: Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Outline
• Foundations of Density-Based Clustering
Core Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex Objects
Direct Integration of the Multi-Step Query Processing Paradigm
• Experimental Evaluation
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Complex Objects
complex
objects
complex
models
complex distance measure
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Single-Step Clustering Approach
Density-based Clustering algorithms,
like DBSCAN and OPTICS
2
1
• Performance Problems
•
For each database object q,
we perform one range query.
•
Query
Q(q,e)
Result
R(q,e)
Expensive exact distance
computation do(o,q) for each
object o of the database
independent of the e range
Exact
information
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Multi-Step Query Processing
• Multi-Step Similarity Search
Filter Step
(index-based)
candidates
Refinement Step
(exact evaluation)
results
Martin Pfeifle, University of Munich
Range Queries (Faloutsos et al. 94)
k-Nearest Neighbor Queries (Korn et al. 96)
Optimal k- Nearest Neighbor Queries (Seidl, Kriegel 98)
• No False Drops?
Lower-Bounding Property
df ( p, q )  do ( p, q )
filter distance object distance
ICDM 2004, Brighton, UK
Traditional Multi-Step Clustering Approach
Density-based Clustering algorithms,
like DBSCAN and OPTICS
1
•
5
Query (q,e)
• Performance Problems
For each database object q, we
perform one range query (1).
Result (q,e)
•
The range query is first
performed on the filter
Range query processor
(e.g. Faloutsos et al. 94)
2
Query
Q(q,e)
using df
3
Candidates
C(q,e)
4
refinement-step
computation of do(o,q) for
all o C(q,e)
information (2,3).
•
One expensive exact distance
computation do(o,q) for each
object o of the candidate set
C(q,e) is performed (4). This
refinement step is very expensive
for non-selective filters or high e
Filter
information
Exact
information
Martin Pfeifle, University of Munich
values.
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended
Density-based Clustering algorithms,
like DBSCAN and OPTICS
• Proposed Solution
•
For each database object q, we
perform one range query on the
• Direct integration of the multi-step query processing
paradigm into the clustering algorithm
• postponing expensive exact distance computations
as long as possible
filter information (1,2).
•
Only those exact distances
do(o,q) are computed which are
1
2
Query Candidates
Q(q,e)
C (q,e)
using df
3
necessary to determine the
4
postponed
computation of
computations of
do(o,q) for
do(o,q) for
Core - properties
Reach.-properties of o
of q
core-properties of q (3).
•
A beneficial heuristic for
determining the reachabilityproperties is applied which
saves on exact distance
Filter
information
Exact
information
Martin Pfeifle, University of Munich
computations (4).
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Determination of Core-Properties
MinPts=3
e=75
Filter Information
Sorted Distance List
dof(K,Q)=10
(K,Q)=53
core-distance of Q =53
I
dof(Z,Q)=12
(Z,Q)=69
K
R
Z
e
Q
(R,Q)=49
dof(R,Q)=18
(R,Q)=53
M
A
df(M,Q)=55
df(A,Q)=58
• First, we carry out a range query on the filter
for each query object Q.
• Second, we order the resulting candidate set
in ascending order according to the filter distance.
• Third, we walk through the candidate set and perform exact
distance calculations until we can be sure that we have found
the MinPts nearest neighbors.
Martin Pfeifle, University of Munich
df(I,Q)=65
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
df(R,B)=18
df(K,B)=20
d0(M,C)=65
first elements are
ascendingly ordered
result list of the current
query object Q which
has to be inserted into
the extended seedlist
do(K,Q)=53
df(R,D)=34
df(K,L)=30
do(Z,Q)=69
df(K,G)=43
do(R,Q)=53
df(K,C)=55
df(M,Q)=55
df(A,Q)=58
each list of predecessor objects is
ascendingly ordered
Martin Pfeifle, University of Munich
df(I,Q)=65
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
d0(M,C)=65
do(K,Q)=53
df(R,D)=34
df(K,L)=30
do(Z,Q)=69
df(K,G)=43
do(R,Q)=53
dof(K,C)=55
(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
d0(M,C)=65
d0(Z,Q)=69
do(K,Q)=53
df(R,D)=34
df(K,L)=30
do(Z,Q)=69
df(K,G)=43
do(R,Q)=53
do(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
d0(M,C)=65
d0(Z,Q)=69
do(K,Q)=53
df(R,D)=34
df(K,L)=30
do(Z,Q)=69
d0(R,Q)=53
df(K,G)=43
do(R,Q)=53
do(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
d0f(M,Q)=55
(M,C)=65
d0(Z,Q)=69
do(K,Q)=53
df(R,D)=34
df(K,L)=30
do(Z,Q)=69
d0(R,Q)=53
df(K,G)=43
do(R,Q)=53
do(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
df(M,Q)=55
df0(A,Q)=58
(Z,Q)=69
do(K,Q)=53
df(R,D)=34
df(K,L)=30
d0(M,C)=65
do(Z,Q)=69
d0(R,Q)=53
df(K,G)=43
do(R,Q)=53
do(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Extended Seedlist
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
result list of the current
query object Q which
has to be inserted into
the extended seedlist
df(R,B)=18
df(K,B)=20
df(M,Q)=55
df(A,Q)=58
dd0(Z,Q)=69
f(I,Q)=65
do(K,Q)=53
df(R,D)=34
df(K,L)=30
d0(M,C)=65
do(Z,Q)=69
d0(R,Q)=53
df(K,G)=43
do(R,Q)=53
do(K,Q)=53
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
dof(R,B)=18
(R,B)=44
df(K,B)=20
df(M,Q)=55
df(R,D)=34
df(K,L)=30
d0(M,C)=65
d0(R,Q)=53
df(K,G)=43
df(A,Q)=58
df(I,Q)=65
d0(Z,Q)=69
do(K,Q)=53
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
dof(R,D)=34
(R,B)=44
df(K,B)=20
df(M,Q)=55
do(R,B)=44
df(K,L)=30
d0(M,C)=65
d0(R,Q)=53
df(K,G)=43
df(A,Q)=58
df(I,Q)=65
d0(Z,Q)=69
do(K,Q)=53
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
• Data Structure “List of Lists”
•
Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
dd0f(K,B)=25
(K,B)=20
dof(R,D)=34
(R,B)=44
df(M,Q)=55
df(K,L)=30
do(R,B)=44
d0(M,C)=65
df(K,G)=43
d0(R,Q)=53
df(A,Q)=58
df(I,Q)=65
d0(Z,Q)=69
do(K,Q)=53
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Outline
• Foundations of Density-Based Clustering
Core Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex Objects
Direct Integration of the Multi-Step Query Processing Paradigm
• Experimental Evaluation
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Experimental Evaluation
•
•
High dimensional feature vectors
representing CAD objects [DASFAA 03]
not very selective filter used
(Euclidean norm)
Martin Pfeifle, University of Munich
Test Data Sets
•
•
•
Graphs representing images [DAWAK 03]
Expensive exact distance function
Selective filter used
ICDM 2004, Brighton, UK
Experimental Evaluation
DBSCAN
Feature vectors
Graphs
f ul l t abl e scan
f ul l tabl e scan
t r adi t i onal mul t i -st ep quer y pr ocessi ng
tr adi ti onal mul ti -step quer y pr ocessi ng
10000
1000
100
10
no. of objects
1
500
•
•
•
i ntegr ated mul ti step quer y pr ocessi ng
runtime [sec.]
runtime [sec.]
i nt egr at ed mul t i st ep quer y pr ocessi ng
1000
2000
3000
10000
1000
100
10
no. of objects
1
500
1000
2000
3000
Already non-selective filters (feature vectors) are helpful for accelerating DBSCAN
by up to an order of magnitude when using the new integrated multi-step query
processing approach.
The traditional multi-step query processing approach does not benefit from nonselective filters (feature vectors), as the cardinality of the candidate set is still high
even when small evalues are used.
When filters of high selectivity (graphs) are used, our new integrated multi-step
query processing approach leads to a speed-up of two orders of magnitude
compared to a full table scan.
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Experimental Evaluation
OPTICS
Feature vectors
Graphs
f ul l tabl e scan
f ul l tabl e scan
tr adi ti onal mul ti -step quer y pr ocessi ng
i ntegr ated mul ti -step quer y pr ocessi ng
tr adi ti onal mul ti -step quer y pr ocessi ng
i ntegr ated mul ti -step quer y pr ocessi ng
100000
runtime [sec.]
runtime [sec.]
10000
1000
100
10
no. of objects
1
500
•
•
1000
2000
3000
10000
1000
100
10
no. of objects
1
500
1000
2000
3000
When using filters of high selectivity (graphs), our new integrated multi-step query
processing approach outperforms the traditional multi-step query processing
approach and the full table scan by a factor of up to 30.
For high evalues, as used with OPTICS, the full table scan performs even better
than the traditional multi-step query processing approach.
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Conclusions
Summary „Efficient Density-Based Clustering of Complex Objects“
• direct integration of the multi-step query processing
paradigm into the clustering algorithm
• MinPts-nearest neighbor queries on the exact information
• postponing expensive exact distance computations as
long as possible
Future Work
• integration of the multi-step query processing paradigm into
other data mining algorithms
Martin Pfeifle, University of Munich
ICDM 2004, Brighton, UK
Related documents