* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Clustering and Summary Generation
Survey
Document related concepts
Transcript
Database Clustering and Summary Generation Tae-Wan Ryu and Christoph F. Eick Similarity Measures For Multi-valued Attributes for Database Clustering Tae-wan Ryu and Christoph F. Eick Department of Computer Science University of Houston Talk Organization Database Clustering Problems of Database Clustering Extended Data Sets Similarity Measures for Sets and Bags An Architecture for Database Clustering Summary and Conclusion General KDD Steps Data sources Selected/Preprocessed data Select/preprocess Transform Data preparation Transformed data Extracted information Knowledge Data mine Interpret/Evaluate/Assimilate Research Goal To develop methodologies, techniques, and tools to create summaries from databases using cluster analysis and genetic programming Our approach Partition the database into groups of similar objects using cluster analysis Find commonalities that objects belonging to each group share using genetic programming Database Summary Generation Steps and Example < Steps > < Example > Database Restaurant database Database Clustering Clusters Groups of similar objects Young White color Retired Summary Generation Midnight Summaries describing the commonalities within each group Dinner Lunch An Example Schema Diagram Marriage hssn wssn mdate numkid husband, 1:n wife, 1:n ehusband, n:1 ewife, n:1 Employee name ssn Department superv, n:1 address sex salary superssn dno dnum works_for, n:1 Works_on essn works_on, 1:n pno Project hours pname control, 1:n pnum project, n:1 ploc dnum dname works_loc, 1:n Dept_loc dnum dloc Preprocessing for Database Clustering Preparing input data sets for clustering Appropriate data selection and preparation from a database is important task Key Problems How to support a user’s viewpoint including attribute selection Data model discrepancy between storage format and the input format that clustering algorithms assume How to cope with structural information, especially 1:n and n:m relationships Input Format for Data Mining Algorithms Data Format for Input Data Sets Single flat file format (basically, the data set has to be stored as a single(!) relation) Complex and structured formats Problem: Almost all existing data mining and clustering approaches assume that input data set is in single flat file format. An Example Database to Illustrate the Problems with Relationship Information in Database Clustering Person ssn name age sex 111111111 Johny 43 M 222222222 Andy 21 F 333333333 Post 67 M 444444444 Jenny 35 F Purchase ssn location ptype amount date 111111111 Warehouse 1 400 02-10-96 111111111 Grocery 2 70 05-14-96 111111111 Mall 3 200 12-24-96 222222222 Mall 2 300 12-23-96 222222222 Grocery 3 100 06-22-96 333333333 Mall 1 30 11-05-96 (a) Joined result name age Johny 43 Johny 43 Johny 43 Andy 42 Andy 42 Post 67 Jenny 35 sex ptype amount location M 1 400 Mall M 2 70 Grocery M 3 200 Warehouse F 2 300 Mall F 3 100 Grocery M 1 30 Mall F null null null (b) ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n (a) an example of Personal relational database, (b) a joined table from Person and Purchase relations Existing Approaches Applying aggregate functions or generalization operators to convert a multi-valued attribute into a single valued attribute. Problems User has to make a critical decision (e.g., which aggregate function to use?) Valuable related information may be lost. Extended Data Sets name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 100 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null name age Johny 43 Andy 21 Post 67 Jenny 35 sex M F M F p.ptype p.amount {1,2,3} {400,70,200} {2,3} {100,100} 1 30 null null p.location {Mall, Grocery, Warehouse} {Mall, Grocery} Mall null A converted table with a bag of values How to measure similarity between bags of values? Group similarity measures are needed. Approaches for Database Clustering Structured database Manual transformation Flat file Clustering algorithms <Current approach> Structured database Automated preprocessing Extended data set <Proposed approach> Generalized Clustering algorithms Related Work LABYRINTH (Thompson et al.) Ketterlin’s extended COBWEB KATE (Manago et al.) SUBDUE (Holder et al.) INLEN (Ribeiro et al.) KBG (Bisson et al.), KLUSTER (Kietz et al.) Research Objectives for Database Clustering To alleviate the representational gab between databases on the one hand and input formats of clustering algorithms on the other hand To design and implement semi-automatic tools to facilitate database clustering To generalize clustering algorithms Generating Extended Data Sets From a Structured Database Database d1, d2, …, dn Extended data set generator Extended data set1 User’s interests and objectives A Unified Similarity Measure for Clustering Extended Data Sets Group Similarity Measures Mixed Types: qualitative, quantitative types. Qualitative type: Tversky’s set-theoretical similarity models. Contrast model S(a,b) = f(AB) f(A B) f(B A), where a and b be two objects, and A and B denote the sets of features for some , , 0; f is the cardinality of the set Ratio model (e.g., normalized similarity) S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0 Group Similarity Measures... continued Quantitative type: group average Group average between group A and B n d ( A, B ) d(a,b) i n , i 1 where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for the ith pair of objects a and b, a A, b B. By taking the average of all the inter-object measures for those pairs of objects from which each object of a pair is in different groups. A Framework for Mixed Type Similarity Measures for Extended Data Sets Gower’s similarity measure for data sets with mixed-types. m m S (a, b) wi si (ai , bi ) / wi i 1 i 1 Extended similarity measure for multi-valued data sets with mixed-types. q q l l S (a, b) [ wi sl (ai , bi ) w j sq (a j , b j )]/ ( wi w j ) i 1 j 1 i 1 j 1 where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for qualitative attributes and quantitative attributes respectively. Clustering Algorithms for Extended Data Sets Nearest-neighbor clustering DBSCAN Leader algorithm Hierarchical clustering Database Clustering Environment Library of clustering algorithms Extended Data set Data Extraction Tool DBMS Clustering Tool User Interface A set of clusters Similarity measure Similarity Measure Tool Default choice and domain information Library of similarity measures Type and weight information A More Detailed Tool Architecture Other Data mining tools Flat file Extended data set Processed data Preprocessor Query result DBMS Query Our data mining tools Form translator User's interests and objectives Join form Database name Relationship definitions Data set of interest Selected attributes Other information A Join Template Form A Join Template Form Begin-spec Database-name: DB; Link-definitions: Link-list; Begin-join Dataset-of-interest: Dsetintrest; Selected-attributes: Attr-list; Objective-attributes: Obj-attr-list; Extended-data-set: E; End-join End-spec An Example of the Interface of the Extended Data Set Generation Tool Begin-spec DB-name: Company Link-definitions: superv(Employee.ssn, Employee.superssn), husband(Employee.ssn, Marriage.hssn), wife(Employee.ssn,Marriage.wssn), ehusband(Marriage.hssn, Employee.ssn), ewife(Marriage.wssn, Employee.ssn), works_on(Employee.ssn, Works_on.essn), project(Works_on.pno, Project.pnum), works_for(Employee.dno, Department.dnum), works_loc(Department.dnum, Dept_loc.dnum) Begin-join Dateset-of-interest: Employee Selected-attributes: ssn, sex, salary, superv.salary, wife.ewife.salary, works_on.hours, works_on.project.pname, works_for.works_loc.dloc Objective-attributes: ssn Output-data-set: E1 End-join End-spec Algorithm to Generate Extended Data Sets Project the Data Set of Interest by Primary key and Selected Attributes Join the Data set of Interest and related data sets to get all related attributes for each join-path Group attributes together that describe the same object Summary Representation Our approach uses database queries as our summary representation language. Queries that compute the objects belonging to a cluster and no other objects are considered to be perfect summaries for a cluster. An example query for a cluster (SELECT ssn name address FROM person purchase WHERE (amount-spent > 1000) and (payment-type = ‘cash’)and (store-name = ‘flea-market’)) “Typically, members in the cluster have spent more than $1,000 cash for shopping in a flea-market” Summary and Contributions Discussed the data model discrepancy between database storage format and input data format for traditional clustering algorithms Discussed the problems of dealing with relationship information in database clustering Presented a different way of representing related information using extended data sets Introduced the design and architecture of an automatic tools to generate extended data sets from databases Generalized the traditional similarity measures and present a framework to cope with extended data sets in similarity-based clustering Architecture of MASSON cluster g1 g2 Object set Schema information user input user interface ... gk Clustering module system input GP based discovery system generate apply DB select user input KB Domain knowledge Query set Interface GP engine Query result evaluate system input Discovered query set return DBMS Evolution Process Initial generation Initial population Generationn generation2 evolve q11, q12,..,q1m evolved population evolve qn1, qn2,..,qnm q21, q22,..,q2m selection crossover mutation evolved population selection crossover mutation n: number of generation m: the size of population selection Solution Q Evolution Process Initial generation Initial population Generationn generation2 evolve q11, q12,..,q1m evolved population evolve qn1, qn2,..,qnm q21, q22,..,q2m selection crossover mutation evolved population selection crossover mutation n: number of generation m: the size of population selection Solution Q