Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao Road Map 2 1. Motivations 2. Cluster Description Problems 3. Related Problems in Machine Learning 4. Related Problems in Theory 5. Related Problems in Computational Geometry 6. Conclusions and Past / Future Work 1. Motivations Clustering Group objects into clusters such that objects within a cluster are similar and objects from different clusters are dissimilar – – 3 one of the major data mining tasks K-means, hierarchical, density-based… 1. Motivations Cluster Descriptions Literature lacks systematic study of cluster descriptions – – – Purposes of cluster descriptions – – Most clustering algorithms just give membership assignments Hidden knowledge in data need to be represented for inference Integration of database and data mining Summarize data to gain initial knowledge Compress data supporting further investigation One way of describing a cluster: DNF formula – A set of isothetic hyper-rectangles 4 Interpretable Used as search condition to retrieve objects 1. Motivations Summarize Data Gain initial and basic understanding of the clusters – Capture shape and location in the multi-dimensional space Requirement: interpretability – Simple and clear structure of description format – Short in length Another requirement: accuracy – – 5 e.g., DNF, B-DNF… B-DNF: Set difference between the bounding box B and a DNF formula describing objects in B but not in cluster C Interpreting the correct target May trade accuracy for shorter length 1. Motivations Compress Data (1) Data compression has important applications in data transmission and data storage – Transmission: distribute clustering results to branches – Storage: support further investigation Partial results or results (in some sense, always partial results) Description process ≈ encoding (compression) Retrieval process ≈ decoding (decompression) – 6 Clustering performed interactively by expert at head office Description as search condition in SELECT query statement 1. Motivations Compress Data (2) Requirement: compression ratio & retrieval efficiency – Short in length Another requirement: accuracy – – 7 comp. ratio = |C| / (|DESC| * 2) Typically < 0.01 Retrieve desired objects May trade accuracy for shorter length: lossy compression / faster retrieval 1. Motivations Compress Data (3) Support Further investigation – Exploratory queries without retrieval of original objects – e.g.,… Incremental clustering Exploratory queries requiring retrieval of original objects Information not associated with the description Information not revealed by the description Interactive and iterative mining environment – 8 Inductive database framework 2. Cluster Description Problems SDL Problem Shortest Description Length (SDL) problem Given a cluster C, its bounding box B, and a description format, obtain a logical expression DESC in the given format with minimum length such that for any object o, (o C o DESC) (o B – C o DESC) Lossless NP-hard – 9 variant of the Minimum Set Cover problem 2. Cluster Description Problems MDQ Problem Maximum Description Quality (MDQ) problem Given a cluster C, its bounding box B, an integer l, a description format, and a quality measure, obtain a logical expression DESC in the given format with length at most l such that the quality measure is maximized Lossy NP-hard: – 10 reducible to the Set Cover problem 2. Cluster Description Problems MDQ Problem: Quality Measure F-measure – – – Harmonic mean of P and R R = |Cdescribed| / |C| P = |Cdescribed| / |Bdescribed| 11 F = 2PR / (P + R) where Cdescribed (Bdescribed) is the set of objects in C (B) that is described by the description RP=1 (Recall at fixed Precision of 1) PR=1 (Precision at fixed Recall of 1) 3. Related Problems in Machine Learning Concept Learning Concept learning: find a classifier for a concept – Typical machine learning problem – Most extensively studied problem in computational learning theory Representation of target concept assumed known – 12 Given a labeled training sequence, generate a hypothesis that is a good approximation of the target concept and can be used as a predictor for other query points DNF formulas, Boolean circuits, geometric objects, pattern languages, deterministic finite automata, etc… 3. Related Problems in Machine Learning Geometric Concept Learning Target is a geometric object – Half spaces. Separated by linear hyper planes – perceptron learning Intersections of half spaces. Convex polyhedra particularly, axis-parallel box – – Unions of axis-parallel boxes 13 More accurate yet simple enough to offer intuitive solutions e.g., 2 boxes. Less studied Popular topic in computational learning theory 3. Related Problems in Machine Learning Learning Models PAC learning model: Probably Approximately Correct [15] – Given a reasonable number of labeled instances, generate with high probability a hypothesis that is a good approximation of the target concept within a reasonable amount of time Exact Learning Model [2] – – The learner is allowed to pose queries in order to exactly identify the target concept within a reasonable amount of time Common queries: answered by an oracle 14 Membership query: feed instance to oracle, return its classification Equivalence query: present hypothesis to oracle, return “correct” or a counterexample 3. Related Problems in Machine Learning Similarities Learn unions of axis-parallel boxes – DNF formula Learner is fed with positive and negative examples which need to be classified accurately 15 3. Related Problems in Machine Learning Dissimilarities (1) : objective Geometric concept learning – exact or approximate identification of geometric object – Ultimate goal is the classifying accuracy over the instance space – fixed number of boxes, usually 1 or 2 “+” and “–” examples are treated equally Cluster description problems – – – Entire instance space is given To describe “+” examples instead of a classify “+” and “–” examples Optimization problem: SDL or MDQ problems. 16 labeled instances are a small part of the instance space Optimization: least disagreement problem related to geometric object construction in computational geometry For quality measure, “+” and “–” examples are not treated equally Examples could be labeled “1”, “5”, “9”... for clustering description 3. Related Problems in Machine Learning Dissimilarities (2) : research focus Geometric concept learning – Learnability – – Learning from queries (membership, equivalence) rather than from random examples to significantly improve performance Low-dimensional Cluster description problems – – – 17 Under certain learning models Heuristics Large data set, possibly high-dimensional Other concerns: compression tool, retrieval efficiency etc. 4. Related Problems in Theory Two Traditional Problems Minimum Set Cover problem [10] – – Given a collection S of subsets of a finite ground set X, find S’ S with minimum cardinality such that U Si S’ Si = X Approximable within ln|S| [8]: greedy set cover Maximum Coverage problem (max k-cover): a variant problem – – 18 Iteratively picks up the subset that covers the maximum number of uncovered elements Maximize the number of covered elements in X using at most k subsets from S. Approximable within e / (e – 1) by the greedy algorithm [8] The ratios are optimal [6] 4. Related Problems in Theory Red Blue Set Cover Problem 19 Given a finite set of “red” elements R, a finite set of “blue” elements B and a family S 2RUB, find a subfamily S’ S which covers all blue elements and minimum number of red elements Raised recently (00’) [4] Related to Group Steiner tree, minimum monotone satisfying assignment and minimum color path problems Reducible to the set cover problem (at least as hard as) [4] 4. Related Problems in Theory SDL Problem is NP-hard a variant of the Minimum Set Cover problem – – Ground set X = set of objects in C, the cluster being described Collection S of rectangles (subsets): – 20 Each Sj S is a rectangle minimally covering some objects in C and none objects not in C S 2C Find S’ S with minimum cardinality such that U Si S’ Si = C 4. Related Problems in Theory MDQ Problem is NP-hard RP=1: a variant of the Maximum Coverage problem – In the same fashion as we showed in last slide – Given k, find S’ S with |S’| at most k such that the number of covered elements in C is maximized 21 PR=1: a variant of the Red Blue Set Cover problem F-measure: reducible to either of the two 4. Related Problems in Theory Similarities and Dissimilarities Set Cover problem and its variants arise in a variety of settings Useful to argue NP-hardness of SDL and MDQ problems Not useful from algorithm design point of view because of the essential differences: in SDL and MDQ problems, the collection of subsets S is not explicitly given – 22 e.g., the greedy set cover algorithm cannot be applied 5. Related Problems in Computational Geometry Maximum Box Problem Given two finite sets of points X + and X - in d, find an axis-parallel box B such that B ∩ X - = and the total number of points from X + covered is maximized Raised recently (02’) [5] NP-hard when d is part of the input Studied by [12] [14] for d = 2 finding exact solutions 23 Corresponds to MDQ problem with k = 1 and RP=1 Techniques for low-dimensional and small datasets 5. Related Problems in Computational Geometry Rectilinear Polygon Covering Problem Find a collection of axis-parallel rectangles with minimum cardinality whose union is exactly to a given rectilinear polygon – Covering rectilinear polygon with axis-parallel rectangles with holes [7] (79’) – Different from partition, overlapping is allowed 2-d – NP-complete [13] No possibility of polynomial time approximation scheme [3] A special case of the general set covering problem – – 24 performance guarantee O(log n) Recent result O(√ log n) [9] 5. Related Problems in Computational Geometry Similarities and Dissimilarities If extended to higher-dimensional space, exactly the same formulation as in Greedy Growth [1] If further allowing “don’t care” cells, similar formulation as in Algorithm BP [11] If not restricted to grid data, same as SDL problem [1] and [11] are related studies in DB literature Logic minimization is also related to the grid case – 25 Most researches are on Boolean variables 6. Conclusions and Past / Future Work Conclusions 26 Cluster (clustering) description is of fundamental importance to KDD, but not systematically studied Formulated problems have significant differences to related problems in machine learning, theory, computational geometry and DB literatures. 6. Conclusions and Past / Future Work Past Work 27 Formulated and provided heuristic algorithms for the SDL and MDQ problem – for both DNF and B-DNF formats – work for both vector and grid data Studied efficient retrieval issue – cost model – efficiency estimation methodology – optimal ordering of description terms 6. Conclusions and Past / Future Work Future Work Optimization of algorithms: focused on introducing concepts and ideas Real data experiments Experimental validation of the retrieval efficiency issue: DBMS Description alphabet: always rectangles? – Clustering description: gaining quality and efficiency – 28 if not, getting farther away from some related problems even more distinguishable from related problems Incremental description: description process could take long Interaction with clustering algorithms: online description? References: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 29 15. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD 1998. D. Angluin. Queries and concept learning. Machine Learning, 2(4): 319-342, 1988. P. Berman and B. Dasgupta. Approximating rectilinear polygon cover problems. Algorithmica, 17: 331-356, 1997. R.D. Carr, S. Doddi, G. Konjevod and M. Marathe. On the red-blue set cover problem. SODA 2000. J. Eckstain, P. Hammer, Y. Liu, M. Nediak, and B. Simeone. The maximum box problem and its application to data analysis. Computational Optimization and Applications, 23(3): 285-298, 2002. U. Feige. A threshold of ln n for approximating set cover. J.ACM 45(4): 634-652. 1998. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York, 1979. D.S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems. PWS, New York, 1997. V.S.A. Kumar and H. Ramesh. Covering rectilinear polygons with axis-parallel rectangles. SIAM J. Comput. 32(6): 1509-1541, 2003. D.S. Johnson. Approximation algorithms for combinatorial problems. J.Comput.System Sci., 9: 256-278, 1974. L.V.S. Lakshmanan, R.T. Ng, C.X. Wang, X. Zhou and T.J. Johnson. The generalized MDL approach for summarization. VLDB 1999. Y. Liu and M. Nediak. Planar case of the maximum box and related problems. CCCG 2003. W.J. Masek. Some NP-Complete Set Covering Problems, manuscript, MIT, Cambridge, MA, 1979. M. Segal. Planar maximum box problem. Journal of Mathematical Modelling and Algorithms, 3: 31-38, 2004. L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984. Questions? 30