Download Mining Knowledge About Changes, Differences, and Trends

Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio Outline • Introduction – Knowledge discovery from databases (KDD) – Knowledge about changes, differences, & trends • Contributions – Changes between datasets – Changes in data cubes – Trends in data cubes KDD 99 & more VLDB 01 & SIGMOD 01 VLDB 02 • Concluding remarks 2 Introduction -- KDD (1) • Mountains of data, everywhere! – Use them  better service, better cure, … • Aims of KDD – Mine valid, novel, potentially useful patterns – Classifiers, clustering, associations, insights, .. • History – Traditional scientific discovery = manual mining – Ancestry of KDD: statistics, machine learning, pattern recognition, database, … – Field started in 1990s • Data forms – Market basket data (transactions) – Relational data – Data cubes (relational + concept hierarchies) 3 Introduction – KDD (2) • Main tasks for KDD – Identifying “useful pattern types” – Giving algorithms for mining them – Finding ways for using them • Our contributions are along these lines 4 Example knowledge patterns about changes, differences, & trends (CDT) • Compare dataset A against dataset B, looking for patterns capturing CDT – Cancer tissues vs normal tissues – Loyal customers vs disloyal customers – Data_1999 vs Data_2000 Gene groups  Drug design Emerging trends • Compare cells in a data cube, looking for similar cells with big measure differences – “Gradients” • Analyze trends in MDML (multidimensional multi-level) manner on a set of time series in data cube 5 Traditional approaches to “mining” CDT • Compare histograms or pie charts of datasets 90 80 70 60 50 40 30 20 10 0 East West North Gain a little Miss a lot 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr • Study time series, one or two at a time • Summaries • Limitations: – Only offer high level view, on very few “factors/variables” – But miss knowledge on many factor groups, many insights 6 Outline • Introduction – Knowledge discovery from databases – Changes, differences, and trends • Contributions – Changes between datasets – Changes in data cubes – Trends in data cubes KDD 99 etc VLDB 01 & SIGMOD 01 VLDB 02 • Concluding remarks 7 Emerging Patterns between Two Datasets Normal Tissues Cancer Tissues g1 g2 g3 g4 g1 g2 g3 g4 L H L H H H L H L H L L L H H H H L L H L L L H L H H L H H H L EP: Patterns w/ high frequency ratio b/w datasets E.G. {g1=L,g2=H,g3=L}; freq ratio = infinite 8 Colon tumor gene expression • 40 tumor, 22 normal colon tissue samples • 6500 genes/sample (Affymetrix Hum6000 micro-array gene chip) 100s of samples g1 g2 g3 g4 20 90 25 80 1000s of dimensions 24 95 23 28 80 20 25 85 25 89 85 25 Original GE data Last page: binned data 9 Top minimal EPs w/ infinite freq ratio NormalEP {25 33 37 {25 33 37 {29 33 35 {29 33 37 41 43 41 43 37 41 41 43 …… 57 47 43 47 59 57 57 57 FreqInNormal 69} 77.3% 69} 77.3% 69} 77.3% 69} 77.3% {6 43 57} {6 47 57} {6 57 69} 77.3% 77.3% 77.3% CancerEP {2 10} {3 10} {10 20} {10 21} {21 58} {15 40 56} {21 40 56} FreqInCancer 70% 67.5% 67.5% 67.5% 65% 62.5% 62.5% Minimal EP with infinite ratio (jumping EPs): all their subsets occur in both classes of tissues Papers using EP techniques in Cancer Cell (cover, 3/02) & in Bioinformatics 10 EP Types of Particular Interest (1) • Minimal jumping EPs for normal tissues  Properly expressed gene groups important for normal cell functioning, but destroyed in all colon cancer tissues  Restore these  ?cure colon cancer? • Minimal jumping EPs for cancer tissues  Bad gene groups that occur in some cancer tissues but never occur in normal tissues  Disrupt these  ?cure colon cancer? • ? Possible targets for drug design ? • Good for classification (later)! 11 EP Types of Particular Interest (2) • Emerging trends in timestamped DBs – E.G. Enrollment of US students in major Canadian univ’s increased by 86% during 99-02, to 5000 – This was news in US papers (Oct 02) – Perhaps an opportunity for Canadian universities • Note: Dominating trends  not opportunities (either you have won or you are out) 12 Related work • Classification/discriminant rules – We’re not limited to classification/high level rules • Association rules – We are more tightly coupled with objectives of application (divide data into “good” and “bad”) • Changes in models of datasets – Only compare fitted decision trees • Other work usually assumes frequency threshold; we may not 13 EP Mining Algorithms • Border-based approach (KDD 99) – Produces border descriptions of desired collections of EPs (structured & concise) – Manipulates borders to get answer • Constraint-based approach (KDD 00) – Look ahead, bound, prune • Tree-based approach (Bailey et al, 01) – Organize data in a tree manner to encourage sharing/reducing work • Still room for improvement High dimens 14 Borders describe large collections • <{12,13}, {12345,12456}> L (min) 12 13 R (max) 123 1234 124 1235 125 1245 126 1246 134 1256 135 1345 12345 12456 {1,3,4,5} 15 Border-Diff: Effect • <{{}},{1234}> - <{{}},{34,24,23}> = <{1,234},{1234}> {} 1, 2, 3, 4 Don’t expand 12, 13, 14, 23, 24, 34 collections 123, 124, 134, 234 1234 • Similar to: [1,100] - [1,50] = (50,100] • Good for: Jumping EPs; EPs in rectangle regions, … 16 EP-based Classification • Classification by aggregating power of EPs NormalEP FreqInNormal {25 33 37 41 43} 80% {25 33 37 41 63} 77.3% {29 33 35 37 41} 77.3% {6 43 67} 77.3% {6 47 77} 77.3% {6 57 69} 60% CancerEP {2 10} {3 10} {10 20} {21 58} {15 40 56} {21 40 56} FreqInCancer 70% 67.5% 67.5% 65% 62.5% 62.5% • T= {2 6 10 25 33 37 41 43 47 57 69} – – – – Normal score (T) = 0.8 + 0.6 = 1.4 Cancer score (T) = 0.7 Class(T) = Normal May also normalize scores … We gave several proposals since 1999 17 EP-based Classification • Very high accuracy: Outperforms best of five other classifiers in 2/3 of 30 UCI datasets • Outperforms SVM on gene expression data • Variants – Using different subsets of selected EPs – Perhaps instance-driven for EP discovery and score computation 18 Why EP-based classifiers are good • Use discriminating power of low support EPs, together with high support ones • Use multi-feature conditions, not just single-feature conditions • Select from larger pools of discriminative conditions – Compare: The search space of patterns for decision trees is limited by early choices. • Combine power of a diversified committee of “experts” (EPs) • Decision is highly understandable 19 Outline • Introduction – Knowledge discovery from databases – Changes, differences, and trends • Contributions – Changes between datasets – Changes in data cubes – Trends in data cubes KDD 99 & more VLDB 01 & SIGMOD 01 VLDB 02 • Concluding remarks 20 Decision support in data cubes • Used for learning from consolidated historical data: – anomalies Wal-Mart success story – unusual factor combinations Initial idea: Codd et al 93 • Focus on modeling & analysis of data for decision makers, not daily operations. • Data organized around major subjects or factors, such as customer, product, time, sales. • Contain huge number of summaries at different levels of details • OLAP operators provided for data analysis 21 Data Cubes -- Base Cells • Sales volume (measure) as a function of product, time, and location (dimensions) Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Time Month Week Day Base cells 22 Data Cubes: Derived Cells 2Qtr Time 3Qtr 4Qtr sum U.S.A Canada Mexico Location TV PC VCR sum 1Qtr Sum, count, avg, max, min, … (TV,*,Mexico) sum Derived cells, offering different levels of details 23 Gradient problem • Find pairs of similar cells (conditions) having big changes in measure values – Q: Find pairs of similar conditions having big changes in total sale price – A: Sales of trucks in West went down 20% from 99 to 00; Sales of (SUVs, East, June01) is 10% higher than (SUVs, West, June01) …… • Similar cells: ances/desc pairs, sibling pairs • Considered by Imielinski et al as Cubegrade Problem • No constraint  costly (see next slide) 24 Huge Space of Cuboids and Cells *** A** AB* *B* A*C **C Coarse to fine *BC ABC Each node is a cuboid. *: ALL Each cuboid represents a set of cells. Cuboid (and cells) form lattices 25 Constrained Gradient Mining • Csig: (cnt100) • Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”) • Cgrad(cg, cp): (avg_price(cg) / avg_price(cp)1.3) (c4, c2) satisfies Cgrad! Dimensions Siblings Ancestor of c1, c2. c3 cid Yr City c1 00 Van Busi c2 * Van c3 * c4 * Measures Cst_grp Prd_grp Cnt Avg_price PC 300 2100 Busi PC 2800 1800 Tor Busi PC 7900 2350 * busi PC 58600 2250 26 LiveSet-Driven Algorithm -- Main Idea -• Compute iceberg of probe cells P using Csig & Cprb • Use P and Cgrad to find gradients – Traverse gradient cells in coarse-to-fine manner, using iceberg H-cubing SIGMOD 01 – Deal with all potential probe cells in one traversal (as live set of probe cells) – Dynamically prune live set during traversal 27 LiveSet • LiveSet(c): set of probe cells cp that may form a gradient-probe pair w/ some desc of current cell c – View current cell as a “set of potential gradient cells” Csig: cnt  100 Cgrad(cg, cp): (cnt(cg)/cnt(cp)  2) Dimensions Measures cid Yr City Cstgrp prdgrp Cnt avgprice P1 00 Van Edu PC 100 1500 P2 99 Tor * PC 4000 1800 P3 * Mon Busi PC 1500 8000 P4 * Edm * Ski 2000 10000 p5 * Whi * Ski 1000 10050 P1, … P5: Global probe cells Cur cell c1=(*,*,Edu,*) • cnt=800 LiveSet(c1)={p2, p4} 28 2-Way Pruning of Gradient Cells and Probe Cells Using LiveSet • Prune current grad cell c if LiveSet(c) = {} • Prune probe cells cp if cp can be ignored in searching c’s descendants – Use min-max boundary check: If constraint cnt(cg)/cnt(cp)>=2 and Cnt values in liveset are: 10, 18, 32, …; min(cnt)=10 then 19/10<2  gradient cells w/ cnt<=19 can be pruned • Handle non anti-monotone constraints, using weaker constraint for pruning (SIGMOD 01) 29 Pruning Probe Cells by Dimension Matching Analysis • Derive LiveSet of child c2 from LiveSet of parent c1 – Since LiveSet(c2)  LiveSet(c1) • Discard probe cells in LiveSet(c2) that are unmatchable with c2 Dimensions Measures # of mismatches (with c3) cid Yr City Cst_grp Prd_grp Cnt Avg_price P1 00 Van Edu PC 100 1500 1 P2 99 Tor * PC 4000 1800 1 P3 * Mon Busi PC 1500 8000 1, 1* LiveSet(c1) = {p1,p2,p3} c1=(00, Tor, *, *) LiveSet(c2) = {p1,p2} c2=(00, Tor, *,PC) 30 An efficient H-cubing method using H-tree H-tree: efficient way to organize data, & to promote sharing/reuse of computation Header Table Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … sum, cnt Sum:2285 … … … … … … … … … … … root Hhd. Edu. Jan. Side-link Tor. Aux-Info Mar. Jan. Bus. Feb. Van. Tor. Mon. A.I. A.I. A.I. Sum: 1765 Cnt: 2 bins 31 H-cubing: Computing Cells Involving Dimension City Edu Hhd Bus … Jan Feb sum cnt … … … … … … … … Attr. Val. Header Table HTor Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … sum, cnt Sum:2285 … … … … … … … … … … … Side-link From (*, *, Tor) to (*, Jan, Tor) root Hhd. Edu. Jan. Side-link Tor. Aux-Info Mar. Jan. Bus. Feb. Van. Tor. Mon. A.I. A.I. A.I. Sum: 1765 Cnt: 2 bins 32 Scalability on Number of Probe Cells 33 Scalability on Gradient Threshold 34 Scalability on Significance Threshold 35 Scalability on Number of Tuples 36 Outline • Introduction – Knowledge discovery from databases – Changes, differences, and trends • Contributions – Changes between datasets – Changes in data cubes – Trends in data cubes KDD 99 & more VLDB 01 & SIGMOD 01 VLDB 02 • Concluding remarks 37 Multi-Dimensional Trends Analysis of Sets of Time-Series -- Overview • Consider applications having many time series – Stocks, power grids, sensor nets, internet, gene expressions for toxicology, … • Needs for MDML trends analysis – Mining/monitoring unusual patterns/events, in MDML manner • Regression cube for time series – Store regression base cube – Support MDML OLAP of regressions • Results also useful for MDML data stream monitoring 38 Why MDML trends analysis • Many time series – E.G. Prices of 10000s of stocks; One time series per stock • Objectives – – – – Understand behavior of stocks/stock groups Find patterns of stock groups Monitor unusual events Find “groups of stocks” – variables -- with interesting patterns (MDML search) 39 Regression based trends analysis A time series: (ti , zi), i =1..n Linear regression model is a linear fitting curve z = a0 + a1 t With least square error Can generalize regression to z = a0+a1f1(t)+a2f2(t)+…+akfk(t) Each f is a fixed function of t Common tool for trends analysis But limited to situations where “variables” (groups of time series) are known 40 Regression cube for time series • There is one initial time series per base cell • Too costly to fully store all time series • Regression base cube – Only store regression parameters of base cells (4 values vs 10000s) – Can we support MDML OLAP of regressions, using only the regression base cube, in lossless manner? • Answer is yes, for both “roll up” on standard dimensions and on time dimension 41 Aggregation in Standard Dimensions Two component cells Aggregated cell We can derive regression of aggregated cell from regression parameters of component cells 42 Aggregation in Time Dimension Cells of 2 adjacent time intervals: Aggregated cell We can derive regression of aggregated cell from regression parameters of component cells 43 Remarks on Regression Cube Efficient storage; scalable (independent of number of tuples in data cells) Lossless aggregation without accessing raw data Fast and efficient aggregation Regression models of data cells at all levels Results cover a large and popular class of regression (linear, polynomial, and other models) 44 Concluding remarks • Mining knowledge about change, differences, & trends (CDT) is useful & exciting • Traditional approaches focus on high level view • We considered CDT mining in transactions, relations, & data cubes • We used discovered CDT patterns for classification, niche mining, & bioinformatics & medical studies • Future work: mining useful CDT knowledge for bioinformatics, bio-medicine, business, … 45 References: Changes, Differences, & Trends • S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 2001. • Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In Knowledge Discovery in Databases, AAAI/MITPress, 1991. • G. Dong and K. Deshpande. Efficient mining of niches and set routines. In Pacific-Asia Conf. On Knowledge Discovery & Data Mining, 2001. • G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. of the 5th ACM SIGKDD Int'l Conf. On Knowledge Discovery and Data Mining, 1999. • G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by aggregating emerging patterns. In Proc. 2nd Int'l Conf. on Discovery Science, Tokyo, 1999. • V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh. A framework for measuring changes in data characteristics. In PODS, 1999. • J. Li, G. Dong, and K. Ramamohanarao. Instance-based classification by emerging patterns. In European Conf. of Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000. 46 References: Changes, Difference and Trends (Cont’d) • J. Li, G. Dong, K. Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Proc Pacific Asia Conf. on Knowledge Discovery & Data Mining, 2000. • J. Li, K. Ramamohanarao, G. Dong. Combining the strength of pattern frequency and distance for classification. In Pacific-Asia KDD, 2001. • J. Li, L. Wong. Identifying good diagnostic genes or genes groups from gene expression data by using the concept of emerging patterns. Bioinformatics. 18:725--734, 2002. • Bing Liu, Wynne Hsu, Heng-Siew Han, and Yiyuan Xia. Mining changes for real-life applications. In DaWaK, 2000. • Bing Liu, Wynne Hsu, and Yiming Ma. Discovering the set of fundamental rule changes. In KDD, 2001. • Eng-Juh Yeoh, …, Jinyan Li, …,Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133—143, March 2002. • X. Zhang, G. Dong, K. Ramamohanarao. Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In KDD, 2000. 47 References: Changes and Trends (Data Cubes) • S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96. • K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99. • S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997. • Y. Chen, G. Dong, J. Han, B. W. Wah, J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB 2002. • E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (on-line analytical processing) to user-analysts: an IT mandate. Tech Report, Codd Associates, 1993. • G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB 2001. • M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98. 48 References: Changes and Trends (Data Cubes) (Cont’d) • • • • • • • • J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. SIGMOD'01. V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD'96. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Tech Report, Computer Science, Rutgers Univ, Aug. 2000. L. V.S. Lakshmanan, J Pei, J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB 2002. K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB'97. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidi-mensional aggregates. SIGMOD'97. 49 Extra Slides • Just in case … 50 Base Cells: Tuples of a Relation Product Location Time Sale Printer Manhattan Jan 1999 100K Laptop Queens Jan 1999 800K … … … … 51 Data Cubes: OLAP OPs Rollup, drilldown, 2Qtr 3Qtr 4Qtr sum pivot U.S.A Canada Mexico Location TV PC VCR sum 1Qtr slice/dice Time sum 52 Experimental Results • Constraints: – Csig is on cnt – Cprb selects set of cells – Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)s) • Data set – 10 dimensions – 10k-20k tuples – Cardinality 10 for each dimension – Measure range: 100-1000 • All-Pairs: One independent search per probe cell 53

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining Knowledge About Changes, Differences, and Trends