Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Administrative No class next Monday (Labor day holiday) 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 Outline for today Mining with constraints Summary 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 Maximal & Closed Patterns Reduce the number of patterns Maximal patterns: the boundary of the frequent vs. infrequent patterns Close patterns: an information- lossless compression of frequent patterns 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 Borders of Frequent Itemsets Connected X and Y are frequent and X is an ancestor of Y implies that all patterns between X and Y are frequent a ab ac abc b ad abd c bc acd d bd cd bcd abcd 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 Closed and Maximal Patterns Solution: Mine closed patterns and max-patterns An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules An itemset X is maximal if X is frequent and there exists no super-pattern Y X such that Y is frequent 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 Closed Patterns and Max-Patterns Exercise. DB = {<a1, a2>, < a1, a2, a3>} Min_sup = 1. What is the set of all frequent patterns? < a1>: 2, < a2>: 2, < a3>: 1, < a1, a2 >: 2, < a1, a3 >: 1, < a2, a3 >: 1, < a1, a2, a3 >: 1, What is the set of max-pattern? < a1, a2, a3 >: 1 What is the set of closed itemset? < a1, a2, a3 >: 1 < a1, a2>: 2 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 Example of Vertical Data Format 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 Frequent, Closed and Maximal Itemsets Database 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Transaction Items 1 ACW T 3 ACW T 4 ACW D 5 ACW TD slide9 Mining Quantitative Associations 1. 2. Techniques can be categorized by how numerical attributes, such as age or salary are treated Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) 3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) one dimensional clustering then association 4. Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9) 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Which Measures Should Be Used? lift and 2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures (Omiecinski@TKDE’03) Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub) 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Knowledge Discovery (KDD) Process Preprocessing: Identify the nature of the data Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: sampling Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Data mining: identify the structure of data Computational task: pattern discovery, classification, clustering, etc. Subdivide computational tasks into computational component: choosing the mining algorithm(s) Result evaluation and presentation Visualization Hypothesis generation Prediction 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 KDD Process: Several Key Steps Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 Constraints in Data Mining Knowledge type constraint: classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in Dec.’02 Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint strong rules: min_support 3%, min_confidence 60% 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Constrained Mining vs. Constraint-Based Search Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 How to Formalize Constraints Anti-monotonicity Monotonicity Succinctness 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 Anti-Monotonicity in Constraint Pushing Anti-monotonicity When an intemset S violates the constraint, so does any of its superset sum(S.Price) v is anti-monotone sum(S.Price) v is not anti-monotone Example. C: range(S.profit) 15 is antimonotone Itemset ab violates C So does every superset of ab 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 TIDTDB (min_sup=2) Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 slide18 Monotonicity for Constraint Pushing TDB (min_sup=2) Monotonicity TID Transaction When an intemset S satisfies the constraint, so does any of its superset 10 a, b, c, d, f 20 b, c, d, f, g, h sum(S.Price) v is monotone 30 a, c, d, e, f 40 c, e, f, g min(S.Price) v is monotone Example. C: range(S.profit) 15 Itemset ab satisfies C So does every superset of ab 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 slide19 Succinctness Succinctness: Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price) v is succinct sum(S.Price) v is not succinct 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 The Apriori Algorithm — Example Database D TID 100 200 300 400 C1 Items 134 235 1235 25 Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C2 itemset sup L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 C3 itemset {2 3 5} 8/30/2006 Frequent Patterns {1 {1 {1 {2 {2 {3 Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Price 1 1 1 1 10 slide21 Naïve Algorithm: Apriori + Constraint Database D TID 100 200 300 400 C1 Items 134 235 1235 25 Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C2 itemset sup L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 C3 itemset {2 3 5} 8/30/2006 Frequent Patterns {1 {1 {1 {2 {2 {3 Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Constraint: Price 1 1 1 1 10 Sum{S.price} < 5 slide22 The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D TID 100 200 300 400 C1 Items 134 235 1235 25 Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C2 itemset sup L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 C3 itemset {2 3 5} 8/30/2006 Frequent Patterns {1 {1 {1 {2 {2 {3 Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Constraint: Price 1 1 1 1 10 Sum{S.price} < 5 slide23 The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D TID 100 200 300 400 C1 Items 134 235 1235 25 Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C2 itemset sup L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 C3 itemset {2 3 5} 8/30/2006 Frequent Patterns {1 {1 {1 {2 {2 {3 Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Constraint: not immediately to be used Price 1 1 1 1 10 min{S.price } <= 1 slide24 Converting “Tough” Constraints Convert tough constraints into anti-monotone or monotone by properly ordering items Examine C: avg(S.profit) 25 Order items in value-descending order TIDTDB (min_sup=2) Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g <a, f, g, d, b, h, c, e> Item Profit a 40 So does afbh, afb* b 0 It becomes anti-monotone! c -20 d 10 e -30 f 30 g 20 h -10 If an itemset afb violates C 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 Strongly Convertible Constraints avg(X) 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X) 25 is convertible monotone w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X) 25 is strongly convertible 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 TDB TID (min_sup=2) Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 slide26 Can Apriori Handle Convertible Constraint? A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework! 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 Handling Multiple Constraints Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 What Constraints Are Convertible? Constraint Convertible antimonotone Convertible monotone Strongly convertible avg(S) , v Yes Yes Yes median(S) , v Yes Yes Yes sum(S) v (items could be of any value, v 0) Yes No No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) Yes No No …… 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct vS no yes yes SV no yes yes SV yes no yes min(S) v no yes yes min(S) v yes no yes max(S) v yes no yes max(S) v no yes yes count(S) v yes no weakly count(S) v no yes weakly sum(S) v ( a S, a 0 ) yes no no sum(S) v ( a S, a 0 ) no yes no range(S) v yes no no range(S) v no yes no avg(S) v, { , , } convertible convertible no support(S) yes no no support(S) no yes no 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 A Classification of Constraints Monotone Antimonotone Succinct Strongly convertible Convertible anti-monotone Convertible monotone Inconvertible 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 Frequent-Pattern Mining: Research Problems Mining fault-tolerant frequent Patterns allows limited faults (insertion, deletion, mutation) Mining truly interesting patterns Surprising, novel, concise, … Theoretic foundation of patterns For compressing data? For classification analysis? Application exploration Pattern discovery in molecule structures Pattern discovery in bionetworks 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 Mining Biological Data Increasing potential Increasing potential to support to support biological discoveries business decisions End User End User Hypothesis Decision testing Making Hypothesis generation Data Presentation Visualization Techniques Domain expert Business Analyst Visualization Techniques Data Mining Information Discovery Data Mining Analyst Data Analyst Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Preprocessing/Integration, Data Warehouses Data Sources DBA Paper, Microarray, Bio-molecule structures, Mass Spectrometry data, … Data Sources DBA Paper, Files, Web documents, Scientific experiments, Database Systems 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 Bio-Data Mining: Classification Schemes Different views lead to different classifications Knowledge view: Kinds of knowledge to be discovered Data view: Kinds of data to be mined Method view: Kinds of techniques utilized Application view: Kinds of applications adapted 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 Summary Constrained item set mining and association rules The data mining process 8/30/2006 Frequent Patterns Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35