Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern Management: Models, Languages, and Architectural Issues Tutorial DASFAA’05 - Beijing, April 16th Barbara Catania DISI - University of Genoa Barbara Catania DASFAA’05 Tutorial 1 Tutorial Objectives • Provide a definition of pattern management • Identify the environments in which pattern management could be useful • Understand the analogies and the differences with data mining, data warehousing and metadata management • Introduce the main requirements of pattern management • Present some theoretical proposals and some standards, and discuss their features with respect to the general pattern management requirements • Discuss open issues Barbara Catania DASFAA’05 Tutorial 2 Outline • Introduction to pattern management • Features – Architecture – Models – Languages • Theoretical proposals • Standards • Open issues Barbara Catania DASFAA’05 Tutorial 3 A lot of data but only few information ... The world produces between 1 and 2 exabytes of unique information per year, which is roughly 250 megabytes for every man, woman, and child on earth (Lyman & Varian, 2003) Barbara Catania DASFAA’05 Tutorial 4 Which Data? • • • • Large datasets Distributed sources Heterogeneous Data is not knowledge! Barbara Catania DASFAA’05 Tutorial 5 Which information? • Knowledge artifacts • Smaller datasets, manageable by humans • Preserve as much as possible the hidden/interesting/ available information of data ⇒ lacrimazione ridotta nessuna Barbara Catania normale astigmatismo no si morbide prescrizione oculistica miopia rigide ipermetropia nessuna DASFAA’05 Tutorial 6 We probably need patterns ... • A compact and semantically reach representation of raw data Data Patterns Barbara Catania DASFAA’05 Tutorial 7 An example T1 Beer, Potato, Chips, Refreshments, Nappies T2 Whisky, Beer, Nappies T3 Detergents, Broom, Beer, Potato Chips T4 Milk, Potato Chips, Tomatoes, Carrots T5 Cigarettes, Meat, Refreshments T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips A1 Beer Potato Chips Barbara Catania DASFAA’05 Tutorial 8 Pattern examples Not all patterns are data mining patterns! Barbara Catania DASFAA’05 Tutorial 9 Should heterogeneous patterns be managed together? • Example 1 – Which items are co-purchased with a certain promotional item p? • Frequent itemsets – What are the circumstances (e.g., location, time, etc.) under which the frequent copurchases were made? • Frequent itemsets and decision trees Barbara Catania DASFAA’05 Tutorial 10 Should heterogeneous patterns be managed together? • Example 2 – Mobile objects monitoring through trajectories • Equations – What objects are similar with respect to their trajectories? • Clusters over equations Barbara Catania DASFAA’05 Tutorial 11 Should patterns be combined? • Example 3 – Classifying customers in China into the categories of highRisk and lowRisk for credit rating • Decision tree T1 – Predicting under which conditions people in China live in cities vs. the countryside • Decision tree T2 – How is it possible to combine T1 and T2 in order to be able to predict under which conditions people have a certain credit rating and tend to live in a certain neighborhood? Barbara Catania DASFAA’05 Tutorial 12 A many-to-many relationship Data space Barbara Catania Pattern space DASFAA’05 Tutorial 13 An example T1 Beer, Potato Chips, Refreshments, Nappies T2 Whisky, Beer, Nappies T3 Detergents, Broom, Beer, Potato Chips T4 Milk, Potato Chips, Tomatoes, Carrots T5 Cigarettes, Meat, Refreshments T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips A1 Beer Potato Chips A2 Beer Refreshment Barbara Catania DASFAA’05 Tutorial 14 But …how can the spaces be characterized? Data space Pattern space 1. The pattern space must provide the representation of heterogeneous patterns 2. Are the two spaces distinct or coincide? • Patterns as a kind of data 3. Is the pattern space defined in terms of the data space? • Patterns as views over data Barbara Catania DASFAA’05 Tutorial 15 … how is pattern importance determined? Data space Pattern space p1 p2 > p1 p2 • How is it possible to quantify data representation realized by a pattern? • Need for measures Barbara Catania DASFAA’05 Tutorial 16 An example T1 Beer, Potato Chips, Refreshments, Nappies T2 Whisky, Beer, Nappies T3 Detergents, Broom, Beer, Potato Chips T4 Milk, Potato Chips, Tomatoes, Carrots T5 Cigarettes, Meat, Refreshments T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips A1 Beer Potato Chips Support(A1) = 3 Confidence (A1) = 0.75 A2 Beer Refreshment Support(A2)= 2 Confidence(A2)= 0.5 S(X) = # of transactions containing X Support (X ⇒ Y) = S(X∪Y) Confidence(X ⇒ Y) = S(X∪Y)/S(X) Barbara Catania DASFAA’05 Tutorial 17 … how is data importance determined? Data space Pattern space d1 d2 d1 > d2 • How many patterns does a data item (or a subset of data items) correspond to? Barbara Catania DASFAA’05 Tutorial 18 … how is the relationship represented? Barbara Catania 1. Set of data items from which a pattern has been extracted (in case of mining) 2. Set of data items represented by a pattern 3. Set of data items possibly represented by a pattern DASFAA’05 Tutorial 19 …how is relationship evolution traced? Data space Pattern space • The data space changes with a high frequency • What happens to the pattern space? • … patterns are not necessarily views over raw data … • Need for synchronization Barbara Catania DASFAA’05 Tutorial 20 Spaces interaction: a generic usage scenario Data space Pattern space 78 1 5 ? 2 9 4 ? 3 6 1. 2. 3. 4. 5. 6. 7. 8. 9. Extraction Select one or more patterns PS1 What are the data represented by such patterns? DS1 Select a certain subset of DS1 DS2 What are the patterns representing possibly approximately DS2? PS2 Is a pattern in PS2 suitable for representing a dataset DS3? How different are two pattern sets? Is the difference significant? How similar are two patterns? How patterns can be combined together? Barbara Catania DASFAA’05 Tutorial 21 Pattern management • • • • • • • • Representation of data and pattern spaces Generation or definition of patterns Storage Retrieval Synchronization Visualization Analysis, inference … Barbara Catania DASFAA’05 Tutorial 22 Pattern management vs data mining Data • Data mining – Generation of new and previously unknown knowledge from large datasets Data mining techniques Knowledge artifacts (patterns) • Pattern management – Generation and management of (heterogeneous) patterns Data mining is an activity of pattern management Barbara Catania DASFAA’05 Tutorial Pattern base Pattern management 23 Pattern management vs data warehousing • Is a pattern-base a sort of a meta-warehouse? – Relationship between data warehouse and source data exist in the form of metadata – Measures are a key concept Data Warehouse Data mining techniques • However – Not necessarily patterns can be represented in terms of dimensions and measures – The DW model is not adequate (and therefore, also the languages) Barbara Catania DASFAA’05 Tutorial Knowledge artifacts (patterns) Pattern base Pattern management 24 Pattern management vs metadata management • Patterns are metadata – Semantic metadata – Knowledge over data Data mining techniques • Metadata are not patterns – No quantification of importance – Not necessarily first-class citizens Barbara Catania Metadata repository Data Warehouse DASFAA’05 Tutorial Knowledge artifacts (patterns) Pattern base Pattern management Metadata repository 25 We need a Pattern Base Management System ... • Pattern-Base Management System (PBMS): technology for – – – – Modeling patterns as first class citizens Querying patterns Efficiently manage patterns Uniformely manage heterogeneous patterns Barbara Catania DASFAA’05 Tutorial 26 Is DBMS technology sufficient? • Patterns can be represented according to some advanced data model Pattern management layer – Object-oriented – Semi-structured – … Data & Pattern Base • … but specific applications are required to cope with patterns Barbara Catania DASFAA’05 Tutorial DBMS 27 Is DBMS technology sufficient? Data Base • Alternatively, we can use DBMS technology to design ad hoc PBMS DBMS Pattern Base PBMS Barbara Catania DASFAA’05 Tutorial 28 Disciplines involved Artificial intelligence DBMS technology Data mining … Pattern management Constraint programming Data warehousing Spatial databases Barbara Catania Metadata management DASFAA’05 Tutorial 29 Outline • Introduction to pattern management • Features – Architectures – Models – Languages • • • • A classification of existing proposals Theoretical proposals Standards Open issues Barbara Catania DASFAA’05 Tutorial 30 PBMS features A PBMS architecture Data Warehouse Data mining techniques A pattern model Knowledge artifacts (patterns) Pattern base Pattern management Languages to query and manipulate patterns Barbara Catania DASFAA’05 Tutorial 31 Integrated architecture Pattern management layer Queries Data & Pattern Base DBMS Barbara Catania DASFAA’05 Tutorial 32 Integrated architecture • One management system • One logical model Less effort in design Limitations due to the chosen model • Pattern generation is a query operator - Need of extending traditional query languages - Mixing manipulation and query operations • Only one language for pattern and data querying Less effort in query design What about optimization? - Barbara Catania DASFAA’05 Tutorial 33 Separated architecture Data queries Cross-over queries Data Base DBMS Barbara Catania Pattern Base Pattern queries PBMS Pattern extraction DASFAA’05 Tutorial 34 Separated architecture • Two management systems • Two logical models - More effort in design • Pattern generation is a pattern manipulation operation Clear distinction between query and manipulation operations • Two groups of query languages - More effort in query design Ad hoc optimization techniques Barbara Catania DASFAA’05 Tutorial 35 The model • • • • • Support of typical (data mining) patterns User-defined pattern types support Hierarchies over pattern types Relation between raw data and patterns Quality Measures Barbara Catania DASFAA’05 Tutorial 36 User-defined pattern types support • Typical data mining patterns (association rules, clusters, decision trees, etc.) are usually supported but often independently managed • The pattern space should be extensible, to guarantee the representation of user-defined pattern types Barbara Catania DASFAA’05 Tutorial 37 Pattern hierarchies • Ability to model hierarchies between pattern types expressivity reusability modularity Barbara Catania DASFAA’05 Tutorial 38 Relationship between raw data and patterns • • • Storage of the relation between patterns and raw data Makes the pattern richer in semantics and provides significant information for pattern retrieval Three different approaches 1 2 3 Set of data items from which the pattern have been extracted (in case of mining) Set of data items represented by a pattern Set of data items possibly represented by a pattern Barbara Catania DASFAA’05 Tutorial 39 Quality Measures • Patterns are usually associated with measures – Association rules: support, confidence, Jmeasure, convinction (Smith, Goodman, 1992) – Clusters: Average intra-cluster distance • In general measures are static – Computed at pattern extraction time – New computation = new pattern extraction Barbara Catania DASFAA’05 Tutorial 40 Languages • Pattern manipulation language – – – – – Automatic extraction Direct insertion of patterns Modifications and deletions Synchronization over source data Mining function • Pattern query language – Queries against patterns • Similarity • Combination – Queries involving source data Barbara Catania DASFAA’05 Tutorial 41 PML: automatic extraction • Capability of a system to generate patterns starting from raw data using a mining function • It corresponds to the data mining step of a knowledge data discovery process • Generates a-posteriori patterns Barbara Catania DASFAA’05 Tutorial 42 PML: direct insertion • Some patterns are not extracted from raw data • Inserted directly from scratch in the system • a-priori patterns • Example • Import or insert a classifier from scratch • Use it to classify existing data Barbara Catania DASFAA’05 Tutorial 43 PML: synchronization • Source data change with high frequency • It is important to determine whether existing patterns, after a certain time, still represent the data source from which they have been generated • If it is not, the ability to change information associated with a pattern when the quality of the representation or its validity during the time change could be useful • Alternative: generation of new patterns Barbara Catania DASFAA’05 Tutorial 44 PML: synchronization P1 Shoes socks Support = 0.55 Confidence = 0.75 Transactions T1 Transactions T1 Transactions T2 P1 Shoes socks Support = 0.55 Confidence = 0.75 P2 Shoes socks Support = 0.7 Confidence = 0.80 P1 Shoes socks Support = 0.7 Confidence = 0.80 P1 Shoes socks Support = 0.55 Confidence = 0.75 P1 Shoes socks Support = 0.7 Confidence = 0.80 Barbara Catania DASFAA’05 Tutorial 45 PML: mining function • A-posteriori patterns are generated from raw data by applying some kind of mining function • Association rules: APriori algorithm • Clusters: k-means algorithm • The presence of a library of mining functions and the possibility to define new functions when required makes pattern generation much more flexible Barbara Catania DASFAA’05 Tutorial 46 PQL: queries over patterns • Primitives for pattern retrieval – Selection – Similarity-based selection • How is it possible to define pattern similarities? – Join: how can patterns be combined together? • If shoes socks and socks t-shirts, what can we say about shoes t-shirts? Barbara Catania DASFAA’05 Tutorial 47 PQL: similarity • Useful whenever we have to measure differences of models describing evolving data or data extracted from different sources – monitoring monthly sales of a supermarket – analyzing differences of data characteristics across several sets of data (customers transactions, reactions to chemical/biological substances) • If similarity is high, there no need to perform a thorough (and costly) analysis on actual data Barbara Catania DASFAA’05 Tutorial 48 PQL: queries over patterns and source data • Cross-over queries • Which data is best represented by a given pattern? • Which patterns represent a given set of data? – A sort of classification Barbara Catania DASFAA’05 Tutorial 49 Spaces interaction: a generic usage scenario 1. Extraction 2. Select one or more patterns PS1 3. What are the data represented by such patterns? DS1 1. Select a certain subset of DS1 DS2 2. What are the patterns representing possibly approximately DS2? PS2 1. Is a pattern in PS2 suitable for representing a dataset DS3? 1. How different are two pattern sets? Is the difference significant? 1. How similar are two patterns? 2. How patterns can be combined together? Barbara Catania DASFAA’05 Tutorial PML PQL selection PQL cross-over QL selection PQL cross-over PQL cross-over PQL measures PQL similarity PQL combination 50 Outline • Introduction to pattern management • Features – Architectures – Models – Languages • • • • A classification of existing proposals Theoretical proposals Standards Open issues Barbara Catania DASFAA’05 Tutorial 51 Pattern management taxonomy Pattern management Standards for patterns Theoretical proposals Integrated architecture Frameworks Barbara Catania Separated architecture Metadata management Pattern similarity Languages DASFAA’05 Tutorial 52 Pattern management taxonomy Pattern management Theoretical proposals Integrated architecture Frameworks Inductive databases (Imielinsky & Mannila, 1996) CINQ project (1998-2002) (De Raedt, 2002) (Meo et Al, 2004) Barbara Catania Standards for patterns Separated architecture Metadata management Pattern similarity Languages DASFAA’05 Tutorial 53 Pattern management taxonomy Pattern management Theoretical proposals Integrated architecture Frameworks Barbara Catania Standards for patterns Separated architecture Metadata management Pattern similarity Languages for inductive databases No storage DMQL(Han et Al., 1996), ODMQL (Elkefy et Al., 2001) Storage, no query Mine Rule (Meo et Al., 1996-1999), XMine (Braga et Al., 2002) Storage, query, recomputation MSQL (Imielinsky & Virmani, 1996-1999) DASFAA’05 Tutorial 54 Pattern management taxonomy Pattern management Theoretical proposals Standards for patterns Integrated architecture Frameworks Languages Barbara Catania Metadata management Pattern similarity Separated architecture 3World Model (Johnson, Lakshmanan, Ng, 2000) PANDA framework (Rizzi et Al, 2001-2004) DASFAA’05 Tutorial 55 Pattern management technology Pattern management Standards for patterns Theoretical proposals Integrated architecture Frameworks Barbara Catania Metadata management Separated architecture Languages Pattern similarity FOCUS (Ganti et Al., 1999) PANDA approach (Bartolini et Al., 2004) DASFAA’05 Tutorial 56 Pattern management taxonomy Pattern management Metadata management Theoretical proposals Integrated architecture Frameworks Barbara Catania Separated architecture Pattern similarity Standards for patterns (data mining standards) PMML CWM ISO SQL/MM JDM API Languages DASFAA’05 Tutorial 57 Pattern management taxonomy Pattern management Standards for patterns Theoretical proposals Integrated architecture Frameworks Barbara Catania Separated architecture Metadata management RDF Dublin Core … Pattern similarity Languages DASFAA’05 Tutorial 58 Outline • Introduction to pattern management • Features – Architectures – Models – Languages • • • • A classification of existing proposals Theoretical proposals Standards Open issues Barbara Catania DASFAA’05 Tutorial 59 Theoretical proposals: what is the aim? • Definition of pattern management frameworks providing a full support for heterogeneous pattern generation and management – back-end technologies for pattern management applications • Similarities for patterns Barbara Catania DASFAA’05 Tutorial 60 Inductive databases • First defined in 1996 (Imielinsky & Mannila, 1996) • Mainly investigated in the context of the EU project CINQ (Consortium on Discovery Knowledge with Inductive Queries, 1998-2002) • Aim – Developing a general theory of inductive databases (IDBs) – Analyze query evaluation for well-known pattern domains (e.g., association rules) and some new ones (e.g., graphs) – Provide extensions of existing query languages – Implement prototypes – Evaluates prototypes against several applications (Web mining, Bio-informatics) Barbara Catania DASFAA’05 Tutorial 61 IDBs: features • Integrated architecture • Model • PML – Extraction is a query operation – No user-defined pattern types (support for common pattern • PQL types) – Constraint theories as – No general hierarchies formal foundation – Patterns represented according – Extension of standard data to the raw data model query languages – Measures – No explicit relationship with raw data Barbara Catania DASFAA’05 Tutorial 62 IDBs • Knowledge discovery as an extended query process – There is no such thing as real discovery, just a matter of the expressive power of the query languages (Imielinsky and Mannila, CACM, Nov. 1996) • General frameworks + inductive extensions for different querying paradigms • Specific types of patterns – – – – – Itemsets Association Rules Sequences Clusters Equations Barbara Catania DASFAA’05 Tutorial 63 IDBs: application domains • Molecular (MOLFEA, 2004) – a domain specific IDB • Association rules and itemsets (Minerule System, 2004) – main paradigm for IDBs Barbara Catania DASFAA’05 Tutorial 64 IDBs: the framework Barbara Catania DASFAA’05 Tutorial 65 IDBs: the framework • An IDB is composed of: – A set of data sets – A set of pattern sets • IDB languages – A query language that generates data sets – An inductive query language that generates pattern sets • Data & pattern sets can be extensional/intensional Barbara Catania DASFAA’05 Tutorial 66 IDBs: the framework • • • • • create data set D as query create view data set D as query create pattern set P as query create pattern view P as query Insert/delete/update statements Barbara Catania DASFAA’05 Tutorial 67 IDBs: theoretical foundations • Formal theory for IDBs (De Raedt et Al, 2002) • For each pattern type – Language of patterns (e.g., itemsets, association rules, sequences, graphs, dependencies, decision trees, clusters) – Evaluation functions (e.g., frequency, closures, generality, validity, accuracy) – Primitive constraints (e.g., minimal/maximal frequency, minimal accuracy) • Constraint programming can be used for extraction and further queries (post-processing) – Constraint-based mining (SIGKDD, 2002) Barbara Catania DASFAA’05 Tutorial 68 IDBs: constraint examples • • • • Cmaxfreq(φ,r) ≡ freq(φ,r) ≤ γ, γ ∈ [0,1] Cminfreq(φ,r) ≡ freq(φ,r) ≥ γ, γ ∈ [0,1] Cclose(φ,r) ≡ closure(φ,r) = φ … Barbara Catania DASFAA’05 Tutorial 69 IDBs: examples of constraint-based computations • Standard association rule mining – Cminfreq(φ,r) ∧ Cminconf(φ,r) • Discriminant patterns – Frequent in one dataset and unfrequent in another – Cminfreq(φ,r1) ∧ Cmaxfreq(φ,r2) • Post-processing – As before, without extractions • Computing condensed (concise) representations – Cminfreq(φ,r) ∧ Cclose(φ,r) – … Barbara Catania DASFAA’05 Tutorial 70 IDBs: the languages Capabilities + retrieval, recomputation + storage MSQL (Imielinsky & Virmani, 1996-1999) XMINE (Braga et Al., 2002) Mine Rule (Meo et Al., 1996-1999) ODMQL (Elkefy et Al., 2001) Only DMQL extraction (Han et Al., 1996) 1996 Barbara Catania 1999 2001 DASFAA’05 Tutorial 2002 Year 71 DMQL (Han et Al., 1996) and ODMQL (Elkefy et Al., 2001) • • • • DMQL: SQL-like language ODMQL: OQL-like language Similar characteristics Extracted patterns – – – – association rules characteristic rules discriminant rules classification rule • No rule storage Barbara Catania DASFAA’05 Tutorial 72 DMQL and ODMQL • For each type of pattern, a set of measures is provided • Conditions over them can be specified • Ability to generalize or specialize the mined results {hiking_boots} {ski_pants} {hiking_boots} {pants} {hiking_boots} {clothes} … Barbara Catania DASFAA’05 Tutorial 73 MineRule (Meo et Al., 1996-1999) • SQL extension with MineRule operator for rule extraction • Transactions are assumed to be stored in relations • Usage of hierarchies over raw data to generalize association rules Rules are generated from purchases performed in the same date Conditions to be satisfied by items in the body and the head MINE RULE MarketAssRules AS SELECT DISTINCT l..n item AS BODY,1..n item AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.price >=100 AND HEAD.price < 100 FROM Purchase Purchases are grouped GROUP BY Customer to form transactions CLUSTER BY date EXTRACTING RULES WITH SUPPORT:0.01, CONFIDENCE:0.2 Barbara Catania DASFAA’05 Tutorial 74 XMINE (Braga et Al., 2002) • Association rules for XML documents • Merge of XQuery and MineRule Barbara Catania DASFAA’05 Tutorial 75 MSQL (Imielinsky & Virmani, 1996-1999) • SQL-like language for rule extraction and management • Transactions are stored in relations • GET_RULES statement: association rules are generated and stored in extended relations GetRules(Employees) into My_Emp_Rules where Body has { (Job=*) }AND Consequent has { (Age=*) } AND confidence > 0.9 and support > 0.3 Source dataset Conditions over rules Barbara Catania DASFAA’05 Tutorial 76 MSQL • SELECT_RULES operator: provides queries over association rules SelectRules(My_Emp_Rules) where Body has { (Sex = *) , (Salary=[30000,80000])} Measure recomputation over views Project Body, Consequent, Confidence(NJ_Emp), Support(NJ_Emp), Confidence(NY_Emp), Support(NY_Emp), SelectRules(My_Emp_Rules) where Body has { (Sex=*) } AND Consequent has {(Car=*) } Barbara Catania DASFAA’05 Tutorial 77 MSQL • SATISFY/VIOLATE: two operators for cross-over queries • Determine whether a tuple satisfies or violates at least one or all the association rules in a given set. Select … From … Where { SATISFIES | VIOLATES } { ALL | ANY } (<GetRules | SelectRules Subquery>) Barbara Catania DASFAA’05 Tutorial 78 Impact on DBMS technology • Constraint programming as theoretical framework • Extensions of existing query languages – Impact on the relational and object-oriented model Barbara Catania DASFAA’05 Tutorial 79 The 3 World framework (Johnson, Lakshmanan, Ng, 2000) • Knowledge discovery as a multistep process • Three main issues: – a model for heterogeneous pattern representation, based on 3 different worlds – Languages for manipulating data in each world – Operators to move in and out of the worlds • No prototype Barbara Catania DASFAA’05 Tutorial 80 3W framework: features • Separated architecture • Model – – – – • PML – A-posteriori patterns – Synchronization – No mining function specification User-defined pattern types Hierarchies Measures Relationship with raw data • PQL – Selection, projection, … – Pattern combination – No similarity – Cross-over queries Barbara Catania DASFAA’05 Tutorial 81 The 3W model I-World Intensional description of patterns D-World E-World Extensional (e.g. by enumeration) description of patterns Raw data from which patterns have been defined Barbara Catania DASFAA’05 Tutorial 82 The I-world • Patterns can be represented as possibly overlapping regions of the data space – Each frequent itemset coincides with a region in the dataspace of itemsets • Regions described as conjunctions of linear constraints p3 – Beer = 1 AND diaper = 1 p1 • Sets of regions form a dimension – All frequent items containing a given promotional item p • Regions are associated with attributes – Measures • Regions can be hierarchically organized p2 – Partial order over regions Barbara Catania DASFAA’05 Tutorial 83 The I-world Isothetic regions: axix parallel and hyper-rectangular efficient manipulation and processing through linear constraints Barbara Catania DASFAA’05 Tutorial 84 The E-World and the D-world • E-World – Extensional representation of a region or dimension by enumerating its components with respect to the D-world • D-world – Relational database Barbara Catania DASFAA’05 Tutorial 85 3W model – relationships with raw data • Source data – D-world • Extensional representation – E-world • Intensional representation – I-world Barbara Catania DASFAA’05 Tutorial 86 An example T1 Beer, Potato, Chips, Refreshments, Nappies T2 Whisky, Beer, Nappies T3 Detergents, Broom, Beer, Potato Chips T4 Milk, Potato Chips, Tomatoes, Carrots T5 Cigarettes, Meat, Refreshments T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips Beer =1 AND Potato Chips = 1 I-World T1 Beer, Potato Chips, Refreshments, Nappies T3 Detergents, Broom, Beer, Potato Chips T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips Barbara Catania D-World DASFAA’05 Tutorial E-World 87 3-W computations Extended relational algebra Dimension algebra refresh E-world lookup I-world populate Mine D-world Relational algebra Barbara Catania DASFAA’05 Tutorial 88 PQL: the dimension algebra • Selection – Overlap, containment, disjointness, non-containment • Projection – Project out some attributes • Purge – Return satisfiable constraints • Cartesian product – Pairwise combination of constraints • Union, Difference – Absence of the compatibility requirement • Renaming Barbara Catania DASFAA’05 Tutorial 89 Impact on DBMS technology • Logical optimization for dimension algebra similar to that for relational algebra • Linear constraint programming • First-order logic augmented with linear polynomial inequalities over reals, and relation variables with fixed arities – PTIME – decidability for equivalence checking • Spatial database technology for pattern management Barbara Catania DASFAA’05 Tutorial 90 The PANDA framework (Rizzi et Al, ER’01) • EU Project PAtterns for Next-generation DAtabase systems (2001-2004) • Aims – lay the foundations for pattern modeling – investigate the main issues involved in managing and querying a pattern-base – outline the requirements for building a PBMS • Preliminary prototype Barbara Catania DASFAA’05 Tutorial 91 PANDA: features • Separated architecture • Model – – – – • PML – A-posteriori and a-priori patterns – Synchronization – Mining function specification User-defined pattern types Hierarchies Measures Relationship with raw data • PQL – Selection, projection, … – Combination – Similarity – Cross-over queries Barbara Catania DASFAA’05 Tutorial 92 PatternsOf Experiment145 Ass.Rules Type MyClustersOn TableEMP Class Layer Dec.Trees Type Cluster Type CyclicCluster Type Type Layer The PANDA architecture member of instance of Ass.Rule 3 Ass.Rule Ass.Rule 2 1 DBSCAN Cluster 2 DBSCAN Cluster 1 Pattern Layer PBMS Intermediate Mapping Layer Data Mining Algorithms Pattern Recognition Algorithms DB1 DB2 Flat File Raw Data Layer Barbara Catania DASFAA’05 Tutorial 93 The PANDA model related-to class pattern type member-of instance-of dec. tree type supermarket rules cluster type pattern my clusters ass. rule type class layer type layer pattern layer Barbara Catania DASFAA’05 Tutorial 94 The PANDA model related-to pattern type name member-of name structure schema source schema class instance-of pattern measure schema PID formula validity period schema structure source measure instantiated formula validity period Barbara Catania DASFAA’05 Tutorial 95 PANDA: Example radius Structure Data Source Measure Formula center Average Intra Cluster Distance Barbara Catania DASFAA’05 Tutorial 96 PANDA: Example n: Cluster s: disk: TUPLE(center:TUPLE(CX1:real,CX2:real), rad:real) d: SET(X1:real, X2:real) m: AvgIntraClusterDistance: real f: (X1 -disk.center.CX1) 2 + (X2 -disk.center.CX2)2 ≤ disk.rad 2 vp: [DAY,DAY) pid: 337 s: disk: TUPLE(center:TUPLE(CX1:2,CX2:3), rad:4) d: users(X,Y) m: AvgIntraClusterDistance: 0.9 f: (X -2) 2 + (Y -3)2 ≤ 42 vp: [01/01/2004,03/31/2004) Barbara Catania DASFAA’05 Tutorial 97 PANDA: Example n: AssociationRule s: TUPLE(head: SET(STRING), body: SET(STRING)) d: BAG(transaction: SET(STRING)) m: TUPLE(confidence: REAL, support: REAL) f: ∀ x (x ∈ head ∨ x ∈ body → x ∈ transaction) vp: [DAY,DAY) pid: 512 s: (head = {'Boots’}, body = {'Socks', 'Hat’}) d: SELECT SETOF(article) AS transaction FROM sales GROUP BY transactionId m: (confidence = 0.75, support = 0.55) f: ∀ x (x ∈ {'Boots', 'Socks','Hat'} → x ∈ transaction) vp: [01/01/2004,03/31/2004) Barbara Catania DASFAA’05 Tutorial 98 PANDA: Example Dataset 1: SELECT SETOF(article) AS transaction FROM sales_shop1 GROUP BY transactionId Apriori Pattern type: AssociationRule Dataset 2: Class: SaleRules Apriori SELECT SETOF(article) AS transaction FROM sales_shop2 GROUP BY transactionId Barbara Catania Patterns: Association rules 512, 513, 514 DASFAA’05 Tutorial Patterns: Association rules 515, 516, 517 99 Pattern Space e Data Space data space pattern type dataset PID name source schema formula backward image type source formula structure structure schema pattern space Barbara Catania pattern DASFAA’05 Tutorial validity period 100 PANDA: Relationships with raw data • Source data – Raw data – Intensional description inside patterns • Extensional representation – Explicit intermediate mapping • Intensional representation – The formula – Approximated intermediate mapping Barbara Catania DASFAA’05 Tutorial 101 PANDA: Pattern Hierarchies • Specialization pattern type 1 class 1 related-to inheritance pattern type 2 • Composition pattern type 1 related-to Refinement pattern type 1 part-of pattern type 2 refined-by pattern type 2 • Ability of referring pattern • Ability of referring pattern types in the structure schema types in the source schema Barbara Catania DASFAA’05 Tutorial 102 PANDA: Pattern Hierarchies Example n: ClusterOfRules ss: representative: AssociationRule ds: SET(rule: AssociationRule) ms: TUPLE(deviationOnConfidence: REAL, deviationOnSupport: REAL) f: rule.ss.head = representative.ss.head Barbara Catania DASFAA’05 Tutorial composition refinement 103 PANDA: Pattern Validity • Temporal validity – Pattern validity with respect Time to user requirements – A certain pattern is assumed to be usable in a given interval I Semantic validity Safety Temporal validity • Semantic validity – Pattern (measure) validity with respect to data source Barbara Catania DASFAA’05 Tutorial Data 104 PANDA PML direct insertion deletion recomputation Patterns synchronization extraction Raw data Barbara Catania DASFAA’05 Tutorial 105 PANDA PML • Both a-priori (direct insertion) and a-posteriori (extraction) patterns with mining function specification • Synchronization: – verifies whether an existing pattern still holds with respect to its source data and, possibly, it changes pattern measures PT1 Raw data UPDATE Barbara Catania PT1 P2 • Recomputation – like synchronization but new patterns are generated from the update p1 s d m m’ X f pv Raw data DASFAA’05 Tutorial P1 C1 PT2 C2 106 PT.. C.. PANDA PQL • Selection – Predicates over all components, including formula • Measure projection – Project out some attributes • Reconstruction – Structure manipulation • Join • Union, Difference • Renaming Barbara Catania DASFAA’05 Tutorial 107 PANDA PQL: join p1.f ∨ p2.f • Intersection join – (p1 >< p2).s = (p1.s,p2.s) – (p1 >< p2).d = p1.d U p2.d – (p1 >< p2).f = p1.f ∧ p2.f • Union join – (p1 >< p2).s = (p1.s,p2.s) – (p1 >< p2).d = p1.d U p2.d p1.f ∧ p2.f – (p1 >< p2).f = p1.f ∨ p2.f Barbara Catania DASFAA’05 Tutorial 108 PANDA PQL: processing • Structure-based processing – Usage of pattern structure – Object-relational processing • Approximated processing – Usage of formula component – Logical, constraint-based processing • Determine all association rules whose Body contains attribute Job • Determine all patterns that represent in an approximate way employees in New York • Mapping-based processing – Usage of relationships with raw data – Data processing Barbara Catania • Determine all patterns that represent only employees in New York DASFAA’05 Tutorial 109 Impact on DBMS technology • Object-relational technology useful for structure manipulation • Constraint programming for formulas • Complexity of the language depends on the chosen formula language – Linear constraints: PTIME • Useful spatial DBMS technology Barbara Catania DASFAA’05 Tutorial 110 Theoretical proposals: architecture and model Architecture 3W model PANDA Inductive DB Separated Separated Integrated region-based Hierarchies Measures Data source User-defined types Validity Prototype Barbara Catania DASFAA’05 Tutorial 111 Theroretical proposals: manipulation 3W model PML A-posteriori (extraction) A-priori (direct insertion) Deletion & update PANDA Inductive DB Extraction as a query operation Synchronization recomputation recomputation recomputation synchronization Mining function Barbara Catania DASFAA’05 Tutorial 112 Theoretical proposals: querying 3W model PQL Combination Algebra Cartesian product Similarity Cross-over queries Barbara Catania PANDA Inductive DB Algebra/cal Constraint-based SQL,OQL like culus Join Integrated archi DASFAA’05 Tutorial 113 Similarities for patterns • Computed with respect to either – Data source represented by patterns • overhead – Just patterns: Pattern structure + Pattern measures • sim(p1,p2) ∈ {0,1} – p1 and p2 have the same type • Two main general approaches – FOCUS (Ganti et Al., 1999) – PANDA approach (Bartolini et Al., 2004) Barbara Catania DASFAA’05 Tutorial 114 Similarities for patterns: FOCUS p1 refinement p2 p1r p2r Faggreg over measures sim(p1,p2) • Refinement – Detection of the greatest common refinement (GCR) of p1.s and p2.s – Recomputation of p1.m and p2.m over GCR • Decision trees, clusters, frequent itemsets can be refined Barbara Catania DASFAA’05 Tutorial 115 Similarities for patterns: FOCUS sim(T1,T2) = |0.0 -0.0| +|0.0 -0.04|+ |0.1-0.14| + |0.0-0.0| + |0.0-0.0| + |0.005-0.1| = 0.175 Barbara Catania DASFAA’05 Tutorial 116 Similarities for patterns:PANDA • No need for refinement • Applicable also to complex patterns, defined over other patterns – Clusters of association rules • Not necessarily requires data access Barbara Catania DASFAA’05 Tutorial 117 Outline • Introduction to pattern management • Features – Architectures – Models – Languages • • • • A classification of existing proposals Theoretical proposals Standards Open issues Barbara Catania DASFAA’05 Tutorial 118 Standards for patterns: what is the aim? • Standard representation purposes for patterns resulting from data mining and data warehousing processes – No generic patterns are supported • support their exchange between different architectures • front-end for pattern management applications Barbara Catania DASFAA’05 Tutorial 119 Standards for patterns: issues 1. Modeling the overall process by which data mining models are produced, used, and deployed 2. A standard representation for data mining and statistical models 3. A standard representation for cleaning, transforming, and aggregating attributes to provide the inputs for data mining models 4. A standard representation for specifying the settings required to build models and to use the outputs of models in other systems 5. Interfaces and Application Programming Interfaces (APIs) to other languages and Systems (Java & SQL) 6. Standards for viewing, analyzing, and mining remote and distributed data Barbara Catania DASFAA’05 Tutorial 120 Standards & theoretical proposals: an overall picture Web Standards Process Standards Standards for pattern representation Standard pattern representation Pattern Base Pattern engine application Theoretical proposals Standard APIs Barbara Catania DASFAA’05 Tutorial 121 Standards for patterns: a classification Process standards Standard APIs Cross Industry Standard Process for Data Mining (CRISP-DM) Predictive Model Markup Language (PMML) Common Warehouse Model for Data Mining (CWM-DM) SQL/MM, JDM Web standards XML for analysis (XMLA) Standards for pattern representation Barbara Catania DASFAA’05 Tutorial 122 Predictive Model Markup Language (PMML) • Standardization effort of DMG (Data Mining group) • XML-based language for representing data mining models • Aim – support the exchange of data mining models between different applications and visualization tools Barbara Catania DASFAA’05 Tutorial 123 PMML usage Pattern management layer Data & Pattern Base DBMS Data Base DBMS Barbara Catania <!-- model in PMML format --> <PMML version="1.1" <TreeModel ModelName="golf" etc. <Node score="play"> etc. </Node> etc. </TreeModel> </PMML> Pattern Base PBMS DASFAA’05 Tutorial 124 PMML • Relatively narrow so that it could serve as common ground for possible subsequent standards – Source data – Mining function – Parameters for the mining function • Specification of the pattern type and pattern instances Barbara Catania DASFAA’05 Tutorial 125 PMML: supported patterns • • • • • • • • • Association Rules Decision Trees Center Based Clustering Distribution Based Clustering (General) Regression Neural Networks Naive Bayes Sequences … Barbara Catania DASFAA’05 Tutorial 126 PMML: pattern type specification • Data dictionary – Describes attributes source data • Mining schema – One for each pattern – For each attribute of the data dictionary specifies whether it is used by the pattern as an input or an output • Transformation dictionary – Defines derived fields • Model statistics • Model parameters – Parameters required by each pattern type • Mining model and functions Barbara Catania DASFAA’05 Tutorial 127 PMML: an example <?xml version="1.0" ?> <PMML version="3.0" > <Header copyright="www.dmg.org" description="example model for association rules"/> <DataDictionary numberOfFields="2" > <DataField name="transaction" optype="categorical" /> <DataField name="item" optype="categorical" /> </ DataDictionary > <AssociationModel modelName =“My_Ass_rule” functionName="associationRules" algorithmName=“Apriori” numberOfTransactions="4" numberOfItems="3" minimumSupport="0.6" minimumConfidence="0.5" numberOfItemsets="3" numberOfRules="2"> <MiningSchema> <MiningField name="transaction" usageType="group" /> <MiningField name="item" usageType="predicted"/> </MiningSchema> Barbara Catania DASFAA’05 Tutorial 128 PMML: an example <!-- We have three items in our input data --> <Item id="1" value="Cracker" /> <Item id="2" value="Coke" /> <Item id="3" value="Water" /> <!-- and two frequent itemsets with a single item --> <Itemset id="1" support="1.0" numberOfItems="1"> <ItemRef itemRef="1" /> </Itemset> <Itemset id="2" support="1.0" numberOfItems="1"> <ItemRef itemRef="3" /> </Itemset> <!-- and one frequent itemset with two items. --> <Itemset id="3" support="1.0" numberOfItems="2"> <ItemRef itemRef="1" /> <ItemRef itemRef="3" /> </Itemset> <!-- Two rules satisfy the requirements --> <AssociationRule support="1.0“ confidence="1.0" antecedent="1" consequent="2" /> <AssociationRule support="1.0“ confidence="1.0" antecedent="2" consequent="1" /> </AssociationModel> </PMML> Barbara Catania DASFAA’05 Tutorial 129 Common Warehouse Model (CWM) • Standardization effort of Object Management Group (OMG) • A common metamodel of the data warehousing and business intelligence domains • Consists of a platform-independent metamodel definition • Includes an XML-based interchange format for metadata • Also includes a mapping to a platform-independent API specification (CORBA IDL) • Tools that standardize on CWM can readily share metadata via CWM-compliant XML files Barbara Catania DASFAA’05 Tutorial 130 CWM architecture MOF Meta-Object Facility CWM Barbara Catania XMI XML Metadata Interchange UML DASFAA’05 Tutorial XML document 131 CWM Data Mining • CWM Metamodel consists of a number of sub-metamodels – – – – Data Resources Data Analysis (OLAP, Data Mining, …) Warehouse Management … Barbara Catania DASFAA’05 Tutorial 132 CWM Data Mining • Three conceptual areas (UML instances) – Model description • MiningModel: a representation of the mining model itself • MiningSettings: driving the construction of the model • ApplicationInputSpecification: set of input attributes for the model • MiningModelResult: result set produced by the testing or application of a generated model. – Settings (for the mining functions) • • • • StatisticsSettings ClusteringSettings AssociationRulesSettings SupervisedMiningSettings – ClassificationSettings – RegressionSettings – Attributes Barbara Catania DASFAA’05 Tutorial 133 CWM-DM patterns • Clustering • Association Rules • Supervised – Classification – Regression • Statistics – Attribute Importance Barbara Catania DASFAA’05 Tutorial 134 SQL/MM • SQL Multimedia and Application Packages (ISO SQL/MM) – specification for data management of data types relevant in multimedia and other knowledge intensive applications in SQL-99 • It defines several class libraries of SQL object types • The structured types defined in such libraries are first-class SQL types accessed through ordinary SQL:1999 • SQL/MM Parts – – – – – Part 1: Framework Part 2: Full-Text Part 3: Spatial Part 5: Still Image Part 6: Data Mining Barbara Catania DASFAA’05 Tutorial 135 SQL/MM Part 6 • Standardized interface to data mining algorithms • Can be layered at the top of any ORDBMS or even deployed as middleware when required • Provides several SQL user-defined types to support pattern extraction, storage, and retrieval of common pattern types Barbara Catania DASFAA’05 Tutorial 136 SQL/MM architecture PML through SQL/MM types PQL through SQL and SQL/MM types Pattern Base SQL/MM types ORDBMS Barbara Catania DASFAA’05 Tutorial 137 SQL/MM: supported patterns • • • • Association rules Clusters Regression Classification Barbara Catania DASFAA’05 Tutorial 138 SQL/MM: supported phases Pattern type & mining function data settings Training phase Source data model Application phase raw data result Patterns Test phase Barbara Catania test result DASFAA’05 Tutorial 139 SQL/MM: types for mining • DM_*Model – Defines the model that you want to use when mining your data • DM_*Settings – Stores various parameters of the data mining model, e.g. depth of a decision tree, maximum number of clusters • DM_*Result – Sets of patterns created by running data mining model against real data • DM_*TestResult – Holds the results of testing during the training phase of the data mining models • DM_*Task – Stores the metadata that describe the process and control of the testing and of the actual runnings * : Clas, Rule, Clustering, Regression Barbara Catania DASFAA’05 Tutorial 140 SQL/MM: additional types • DM_MiningData – Abstraction for real data contained in tables or views – It just stores metadata to access the real data sources and any other information necessary to make the real data accessible for a later data mining training or test run (e.g., transformations) • DM_MiningMapping – Allows the specification of data mining field related information (e.g., categorical) • DM_ApplicationData – Abstraction for a set of values with associated names representing a single row of input data Barbara Catania DASFAA’05 Tutorial 141 SQL/MM: type interactions Barbara Catania DASFAA’05 Tutorial 142 SQL/MM: type interaction Barbara Catania DASFAA’05 Tutorial 143 JDM • Java Specification Request -73 (JSR-73) also known as Java Data Mining (JDM) • Pure Java API to support – – – – Creation Storage Access Maintenance of data and metadata supporting data mining models, data scoring and data mining results • Input/output in various format (PMML, CWMDM) Barbara Catania DASFAA’05 Tutorial 144 JDM architecture JDM API Data mining engine (DME) Pattern base (Mining Object Repository) PBMS Barbara Catania DASFAA’05 Tutorial 145 JDM: supported patterns • • • • • Classification Regression Attribute importance Clustering Association rules … and several algorithms … Barbara Catania DASFAA’05 Tutorial 146 JDM: supported operations • Building a Model – Users define input tasks specifying the parameters model name, mining data and mining settings – Specification pattern type and mining function details • Testing a Model – Gives an estimate of the accuracy a model has in predicting the target • Applying a Model – Model is applied to a case. Produces one or more predictions or assignments – Pattern extraction • Object Import and Export – Interchange with other DMEs, Persistent storage outside the DME – Object inspection or manipulation – To enable import and export of system metadata JDM specifies 2 standards for defining metadata in XML: PMML and CWM – A-priori patterns • Computing statistics on data – computes various statistics on a given physical data set – Measure computation • Verifying task correctness Barbara Catania DASFAA’05 Tutorial 147 JDM: an example // Create the physical representation of the data (1) PhysicalDataSetFactory pdsf=(PhysicalDataSetFactory) dmeConn.getFactory("javax.datamining.data.PhysicalDataSet"); (2) PhysicalDataSet bd = pdsf.create(uri, true); (3) dmeConn.saveObject( "myBuildData", bd, false ); // Create the settings to build an association rule model (4) AssociationSettingsFactory asf=(AssociationSettingsFactory)dmeConn.getFactory ("javax.datamining.association.AssociationSettings"); (5) AssociationSettings associationSettings = asf.create(); (6) associationSettings.setMaxNumberOfRules( 100 ); (7) dmeConn.saveObject("myAssBS",associationSettings,false); Barbara Catania DASFAA’05 Tutorial 148 JDM: an example // Create a task to build an association model with data and settings (8) BuildTaskFactory btf = (BuildTaskFactory) dmeConn.getFactory( "javax.datamining.task.BuildTask" ); (9) BuildTask task = btf.create( "myBuildData", "myAssBS", "myAssociationModel" ); (10) dmeConn.saveObject( "myAssTask", task, false ); // Execute the task and check the status (11) ExecutionHandle handle = dmeConn.execute( "myAssTask" ); (12) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done (13) // check the returned status … Barbara Catania DASFAA’05 Tutorial 149 JDM: an example // Restore an association model to extract association rules (1) AssociationModel assocModel = (AssociationModel) dmeConn.retrieveObject( "myAssociationModel"); // Specify rule selection criteria (support >= 30% AND confidence >= 90%) (2) RulesFilterFactory filterFactory = (RulesFilterFactory) dmeConn.getFactory( "javax.datamining.association.RulesFilterFactory" ); (3) RulesFilter rulesFilter = filterFactory.create(); // The range of the support values is from 0.3 (30%) to 1.0 (100\%). (4) rulesFilter.setRange( RuleProperty.support, 0.3, 1.0 ); // The range of the confidence values is 0.9 (90%) to 1.0 (100%). (5) rulesFilter.setRange( RuleProperty.confidence, 0.9, 1.0 ); Barbara Catania DASFAA’05 Tutorial 150 JDM: an example // Extract rules from the model using the filtering criteria (6) Collection rulesCollection = assocModel.getRules( rulesFilter ); (7) Iterator ruleIt = rulesCollection.iterator(); (8) while( ruleIt.hasNext() ) { (9) AssociationRule r = (AssociationRule) ruleIt.next(); (10) /* work with the rule retrieved here...*/ } Barbara Catania DASFAA’05 Tutorial 151 Standards: models JDM PMML CWM-DM SQL/MM User-defined pattern types Hierarchies Measures Data source Validity Barbara Catania DASFAA’05 Tutorial 152 Standards: manipulation and querying SQL/MM JDM Languages SQL Java A-posteriori A-priori Deletion Synchronization Mining function Combination Similarity Cross-over queries Barbara Catania DASFAA’05 Tutorial 153 Standards: the commercial DBMS choices • Most commercial DBMSs have been extended with data mining functionalities – Oracle Data Mining – Microsoft SQL Server 2005 Data Miner – IBM Intelligent Miner • They usually provide – SQL extensions for pattern representation and manipulation – Oracle Java API – Import/Export in PMML for patterns Barbara Catania DASFAA’05 Tutorial 154 Outline • Introduction to pattern management • Features – Architectures – Models – Languages • • • • A classification of existing proposals Theoretical proposals Standards Open issues Barbara Catania DASFAA’05 Tutorial 155 Where are we now? • Frameworks more expressive than existing standard proposals • Lack in modeling – No user-defined patterns – No hierarchies • Lack in manipulation – – – – No manipulation of heterogeneous patterns Similarity functions Pattern combination operators Pattern synchronization with source data Barbara Catania DASFAA’05 Tutorial 156 Where are we now? • Are those characteristics really needed? – Combined efforts with industries for establishing the real need of those features Barbara Catania DASFAA’05 Tutorial 157 What else? • Measure ontologies – Pattern comparison based on measures – Various strategy for measure computations • general probabilities, Dempster-Schafer, Bayesian Networks – Need of measure ontologies for quantitative pattern reasoning Barbara Catania DASFAA’05 Tutorial 158 What else? • Physical design – What is a reasonable physical layer for patterns? – What are reasonable clustering techniques for patterns? – What about reasonable indexing techniques? Barbara Catania DASFAA’05 Tutorial 159 What else? • Query optimization – Separated architecture • Data-based computations versus pattern-based computations • Heuristics: pattern-based computations are more efficient • How is it possible to use patterns to reduce data access in data and cross-over queries? • How can data and pattern query processors be combined? – Integrated architecture • extraction optimization Barbara Catania DASFAA’05 Tutorial 160 What else? • Query optimization in integrated architectures (IDBs) – Itemsets and association rule mining – Extraction optimization based on constraints usage (Ng et Al., 1998) • Anti-monotonic, monotonic, succinct constraints – Incremental refinement (Baralis, Psaila, 1999) – Condensed representations (various proposals of the CINQ consortium) Barbara Catania DASFAA’05 Tutorial 161 What else? • Access control – Patterns are high-sensitive information – An authorized access over data may correspond to an unauthorized access over patterns extracted from those data – Instance of the inference problem (Farkas & Jajodia, 2002) Barbara Catania DASFAA’05 Tutorial 162 What else? • Access control approaches – Preprocessing techniques: checking through mining techniques whether it is possible to infer sensitive data – Run-time techniques: release patterns only when they do not represent sensitive information – Data modifications: perturbation and sample size restrictions are applied without disturbing data mining results Barbara Catania DASFAA’05 Tutorial 163 What else? • Access control – What happens when a pattern is used against a dataset which is not the source dataset? • Cross-over computations may reduce the effect of existing techniques Barbara Catania DASFAA’05 Tutorial 164 Main references (1) • Agrawal, R., Srikant, R. (1994) Fast Algorithms for Mining Association Rules in Large Databases. In Proc. of the 20th VLDB, pages 487–499 • Bartolini, I., Ciaccia, P., Ntoutsi, I., Patella, M., Theodoridiss, Y. (2004) A Unified and Flexible Framework for Comparing Simple and Complex Patterns. In LNAI 3202: Proc. of the 15th ECML/PKDD, pages 496–499. • Baralis, E. and Psaila, G (1999). Incremental refinement of mining queries. In LNCS 1676: Proc. of DaWaK’99, pages 173–182. • Braga, D., Campi, A., Klemettinen, M., Lanzi, P.L. (2002) Mining Association Rules from XML Data. In Proc. of DaWaK, pages 21–30. • Catania, B., Maddalena, A., Mazza, M., Bertino, E., Rizzi, S. (2004). A Framework for Data Mining Pattern Management. In LNAI 3202: Proc. of the 15th ECML/PKDD, pages 87–98. • De Raedt, L. (2002). A Perspective on Inductive Databases. ACM SIGKDD Explorations Newsletter, 4(2), pages 69–77. • De Raedt, L., Jaeger, M., Lee, S.D., Mannila, H.(2002) A Theory on Inductive Query Answering. In Proc. of ICDM, pages 123–130. • Elfeky, M. G., Saad, A., Fouad, S.A. (2001). ODMQL: Object Data Mining Query Language. Lecture Notes in Computer Science (1944), pages 128–140. Barbara Catania DASFAA’05 Tutorial 165 Main references (2) • Farkas, C., Jajodia, S. (2002) The Inference Problem: a Survey. SIGKDD Explor. Newsl., 4(2): 6–11. • Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.-Y. (1999) A Framework for Measuring Changes in Data Characteristics. In Proc. of PODS’99, pages 126– 137. • Han, J., Fu, Y., Wang, W., Koperski, K., Zaiane, O. (1996). DMQL: A data mining query language for relational databases. In Proc.of ACM SIGMOD'96 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'96). • Han,J., Kamber,M. (2001). Data Mining: Concepts and Techniques. Academic Press. • Imielinski, T., Mannila, H. (1996). A Database Perspective on Knowledge Discovery. Communications of the ACM, 39(11): 58–64. • Imielinski, T. , Virmani, A. (1999). MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery, 2(4): 373–408. • Johnson,S., Lakshmanan, L.V.S., Ng, R.T. (2000). The 3W Model and Algebra for Unified Data Mining. In Proc. of VLDB, pages 21–32. • Lyman, P. and Varian, H. R. (2003). How much information. Available at http://www.sims.berkeley.edu/how-much-info-2003 Barbara Catania DASFAA’05 Tutorial 166 Main references (3) • Meo, R., Psaila, G., Ceri, S. (1996) A New SQL-like Operator for Mining Association Rules. In Prof. of VLDB, pages 122–133. • Meo, R., Psaila, G., Ceri,S. (1999). An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery, 2(2): 195–224. • Meo, R., Lanzi, P.L., Klemettinen, M. (editors) (2004). Database Support for Data Mining Applications - Discovering Knowledge with Inductive Queries. LNAI 2682. • Ng, R., Lakshmanan, L. V., Han, J., Pang, A. (1998) Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In Proc. of SIGMOD’98, pages 13–24. • Rizzi, S. et Al. (2003). Towards a Logical Model for Patterns. In Proc. of the 22nd Int. Conf. on Conceptual Modeling (ER 2003), pages 77–90. • SIGKDD Explorations (2002). Special Issue on Constraint-Based Mining. • Smyth, P. and R. M. Goodman, R.M. (2002) An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301–316. • Theodoridis, Y., Vazirgiannis, M., Vassiliadis, P., Catania, B., Rizzi, S.(2003) A Manifesto for Pattern Bases. PANDA Technical Report TR2003-03, 2003. Barbara Catania DASFAA’05 Tutorial 167 References: standards • PMML (2003). Predictive Model Markup Language. http://www.dmg.org/pmml-v3-0.html • CWM (2001). Common Warehouse Metamodel. http://www.omg.org/cwm • MOF (2003). Meta-Object Facility specification. http://www.omg.org/technology/documents/formal/mof.htm • XMI (2003) XML Metadata Interchange specification http://www.omg.org/technology/documents/formal/mof.htm • J. Melton and A. Eisenberg (2001) SQL Multimedia and Application Packages (SQL/MM)”, SIGMOD Record, 30(4): 97–102, December 2001. • JDM (2003). Java Data Mining API. http://www.jcp.org/jsr/detail/73.prt Barbara Catania DASFAA’05 Tutorial 168 References: projects • CINQ (2001). The CINQ project. http://www.cinq-project.org – Minerule System (2004). Minerule Mining System (demo version) http://kdd.di.unito.it/minerule2/demo.html – MOLFEA (2004) The Molecular Feature Miner based on the LVS Algorithm. (demo version). http://www.predictive-toxicology.org/cgi bin/molfea/molfea.cgi • PANDA (2001). The PANDA Project. http://dke.cti.gr/panda/ Barbara Catania DASFAA’05 Tutorial 169