Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer Science Brighton,UK November 01-04, 2004 Outline • Density-Based Clustering • Clustering of Complex Objects • Experimental Evaluation Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Outline • Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS • Clustering of Complex Objects • Experimental Evaluation Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Data Mining • Larger and larger amounts of data collected automatically Hubble Space Telescope Telecommunication Data Market-Basket Data • Too large for humans to analyze manually • Tools to assist analysis necessary KDD / Data Mining Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Clustering • Clustering – Efficiently grouping the database into sub-groups (clusters) such that • similarity within clusters maximized • similarity between clusters minimized Flat Clustering Hierarchical Clustering one level of clusters nested clusters e.g. density-based clustering algorithm DBSCAN [KDD 96] e.g. density-based clustering algorithm OPTICS [SIGMOD 99] Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Density-Based Clustering I • Parameters – range e and minimal weight MinPts MinPts=5 • Definition: core object – • q q is core object if | rangeQuery (q,e) | MinPts p Definition: directly density-reachable – p directly density-reachable from q if MinPts=5 q q is a core object and p rangeQuery (q,e) • Definition: density-reachable – density-reachable: transitive closure of “directly density-reachable” Martin Pfeifle, University of Munich q MinPts=5 o r ICDM 2004, Brighton, UK Density-Based Clustering II • Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering. Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Density-Based Clustering II • Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering. MinPts = 5 o e core-distance(o) • Definition: core-distance if | rangeQuery (o, e ) | MinPts core dist e , MinPts(o) MinPts dist (o) otherwise Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Density-Based Clustering II • Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering. • Definition: core-distance p MinPts = 5 p o e core-distance(o) reachability-distance(p,o) reachability-distance(p,o) if | rangeQuery (o, e ) | MinPts core dist e , MinPts(o) MinPts dist (o) otherwise • Definition: reachability-distance reach diste , MinPts ( p, o ) max(core diste , MinPts ( o ),dist( p, o )) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R seedlist: Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B coredistance A e K I M L J N P R A seedlist: (B,40) (I, 40) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R A B seedlist: (I, 40) (C, 40) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R A B I seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R A B I J seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J … N P R A B I J L seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40) Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R A B I J L M K NR P C D F G E H seedlist: Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK OPTICS Algorithm • Example Database (2-dimensional, 16 points) • e = 44, MinPts = 3 reach D E G F H C 44 B K A I M L J N P R A B I J L M K NR P C D F G E H seedlist: Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Outline • Foundations of Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS • Clustering of Complex Objects Direct Integration of the Multi-Step Query Processing Paradigm • Experimental Evaluation Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Complex Objects complex objects complex models complex distance measure Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Single-Step Clustering Approach Density-based Clustering algorithms, like DBSCAN and OPTICS 2 1 • Performance Problems • For each database object q, we perform one range query. • Query Q(q,e) Result R(q,e) Expensive exact distance computation do(o,q) for each object o of the database independent of the e range Exact information Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Multi-Step Query Processing • Multi-Step Similarity Search Filter Step (index-based) candidates Refinement Step (exact evaluation) results Martin Pfeifle, University of Munich Range Queries (Faloutsos et al. 94) k-Nearest Neighbor Queries (Korn et al. 96) Optimal k- Nearest Neighbor Queries (Seidl, Kriegel 98) • No False Drops? Lower-Bounding Property df ( p, q ) do ( p, q ) filter distance object distance ICDM 2004, Brighton, UK Traditional Multi-Step Clustering Approach Density-based Clustering algorithms, like DBSCAN and OPTICS 1 • 5 Query (q,e) • Performance Problems For each database object q, we perform one range query (1). Result (q,e) • The range query is first performed on the filter Range query processor (e.g. Faloutsos et al. 94) 2 Query Q(q,e) using df 3 Candidates C(q,e) 4 refinement-step computation of do(o,q) for all o C(q,e) information (2,3). • One expensive exact distance computation do(o,q) for each object o of the candidate set C(q,e) is performed (4). This refinement step is very expensive for non-selective filters or high e Filter information Exact information Martin Pfeifle, University of Munich values. ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Density-based Clustering algorithms, like DBSCAN and OPTICS • Proposed Solution • For each database object q, we perform one range query on the • Direct integration of the multi-step query processing paradigm into the clustering algorithm • postponing expensive exact distance computations as long as possible filter information (1,2). • Only those exact distances do(o,q) are computed which are 1 2 Query Candidates Q(q,e) C (q,e) using df 3 necessary to determine the 4 postponed computation of computations of do(o,q) for do(o,q) for Core - properties Reach.-properties of o of q core-properties of q (3). • A beneficial heuristic for determining the reachabilityproperties is applied which saves on exact distance Filter information Exact information Martin Pfeifle, University of Munich computations (4). ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Determination of Core-Properties MinPts=3 e=75 Filter Information Sorted Distance List dof(K,Q)=10 (K,Q)=53 core-distance of Q =53 I dof(Z,Q)=12 (Z,Q)=69 K R Z e Q (R,Q)=49 dof(R,Q)=18 (R,Q)=53 M A df(M,Q)=55 df(A,Q)=58 • First, we carry out a range query on the filter for each query object Q. • Second, we order the resulting candidate set in ascending order according to the filter distance. • Third, we walk through the candidate set and perform exact distance calculations until we can be sure that we have found the MinPts nearest neighbors. Martin Pfeifle, University of Munich df(I,Q)=65 ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. df(R,B)=18 df(K,B)=20 d0(M,C)=65 first elements are ascendingly ordered result list of the current query object Q which has to be inserted into the extended seedlist do(K,Q)=53 df(R,D)=34 df(K,L)=30 do(Z,Q)=69 df(K,G)=43 do(R,Q)=53 df(K,C)=55 df(M,Q)=55 df(A,Q)=58 each list of predecessor objects is ascendingly ordered Martin Pfeifle, University of Munich df(I,Q)=65 ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 d0(M,C)=65 do(K,Q)=53 df(R,D)=34 df(K,L)=30 do(Z,Q)=69 df(K,G)=43 do(R,Q)=53 dof(K,C)=55 (K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 d0(M,C)=65 d0(Z,Q)=69 do(K,Q)=53 df(R,D)=34 df(K,L)=30 do(Z,Q)=69 df(K,G)=43 do(R,Q)=53 do(K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 d0(M,C)=65 d0(Z,Q)=69 do(K,Q)=53 df(R,D)=34 df(K,L)=30 do(Z,Q)=69 d0(R,Q)=53 df(K,G)=43 do(R,Q)=53 do(K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 d0f(M,Q)=55 (M,C)=65 d0(Z,Q)=69 do(K,Q)=53 df(R,D)=34 df(K,L)=30 do(Z,Q)=69 d0(R,Q)=53 df(K,G)=43 do(R,Q)=53 do(K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 df(M,Q)=55 df0(A,Q)=58 (Z,Q)=69 do(K,Q)=53 df(R,D)=34 df(K,L)=30 d0(M,C)=65 do(Z,Q)=69 d0(R,Q)=53 df(K,G)=43 do(R,Q)=53 do(K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Extended Seedlist • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. result list of the current query object Q which has to be inserted into the extended seedlist df(R,B)=18 df(K,B)=20 df(M,Q)=55 df(A,Q)=58 dd0(Z,Q)=69 f(I,Q)=65 do(K,Q)=53 df(R,D)=34 df(K,L)=30 d0(M,C)=65 do(Z,Q)=69 d0(R,Q)=53 df(K,G)=43 do(R,Q)=53 do(K,Q)=53 df(M,Q)=55 df(A,Q)=58 df(I,Q)=65 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Determination of Next Query Object • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. dof(R,B)=18 (R,B)=44 df(K,B)=20 df(M,Q)=55 df(R,D)=34 df(K,L)=30 d0(M,C)=65 d0(R,Q)=53 df(K,G)=43 df(A,Q)=58 df(I,Q)=65 d0(Z,Q)=69 do(K,Q)=53 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Determination of Next Query Object • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. dof(R,D)=34 (R,B)=44 df(K,B)=20 df(M,Q)=55 do(R,B)=44 df(K,L)=30 d0(M,C)=65 d0(R,Q)=53 df(K,G)=43 df(A,Q)=58 df(I,Q)=65 d0(Z,Q)=69 do(K,Q)=53 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Determination of Next Query Object • Data Structure “List of Lists” • Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. dd0f(K,B)=25 (K,B)=20 dof(R,D)=34 (R,B)=44 df(M,Q)=55 df(K,L)=30 do(R,B)=44 d0(M,C)=65 df(K,G)=43 d0(R,Q)=53 df(A,Q)=58 df(I,Q)=65 d0(Z,Q)=69 do(K,Q)=53 Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Outline • Foundations of Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS • Clustering of Complex Objects Direct Integration of the Multi-Step Query Processing Paradigm • Experimental Evaluation Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Experimental Evaluation • • High dimensional feature vectors representing CAD objects [DASFAA 03] not very selective filter used (Euclidean norm) Martin Pfeifle, University of Munich Test Data Sets • • • Graphs representing images [DAWAK 03] Expensive exact distance function Selective filter used ICDM 2004, Brighton, UK Experimental Evaluation DBSCAN Feature vectors Graphs f ul l t abl e scan f ul l tabl e scan t r adi t i onal mul t i -st ep quer y pr ocessi ng tr adi ti onal mul ti -step quer y pr ocessi ng 10000 1000 100 10 no. of objects 1 500 • • • i ntegr ated mul ti step quer y pr ocessi ng runtime [sec.] runtime [sec.] i nt egr at ed mul t i st ep quer y pr ocessi ng 1000 2000 3000 10000 1000 100 10 no. of objects 1 500 1000 2000 3000 Already non-selective filters (feature vectors) are helpful for accelerating DBSCAN by up to an order of magnitude when using the new integrated multi-step query processing approach. The traditional multi-step query processing approach does not benefit from nonselective filters (feature vectors), as the cardinality of the candidate set is still high even when small evalues are used. When filters of high selectivity (graphs) are used, our new integrated multi-step query processing approach leads to a speed-up of two orders of magnitude compared to a full table scan. Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Experimental Evaluation OPTICS Feature vectors Graphs f ul l tabl e scan f ul l tabl e scan tr adi ti onal mul ti -step quer y pr ocessi ng i ntegr ated mul ti -step quer y pr ocessi ng tr adi ti onal mul ti -step quer y pr ocessi ng i ntegr ated mul ti -step quer y pr ocessi ng 100000 runtime [sec.] runtime [sec.] 10000 1000 100 10 no. of objects 1 500 • • 1000 2000 3000 10000 1000 100 10 no. of objects 1 500 1000 2000 3000 When using filters of high selectivity (graphs), our new integrated multi-step query processing approach outperforms the traditional multi-step query processing approach and the full table scan by a factor of up to 30. For high evalues, as used with OPTICS, the full table scan performs even better than the traditional multi-step query processing approach. Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK Conclusions Summary „Efficient Density-Based Clustering of Complex Objects“ • direct integration of the multi-step query processing paradigm into the clustering algorithm • MinPts-nearest neighbor queries on the exact information • postponing expensive exact distance computations as long as possible Future Work • integration of the multi-step query processing paradigm into other data mining algorithms Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK