Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
15-721 Database Management Systems Data Mining - Data Cubes Anastassia Ailamaki http://www.cs.cmu.edu/~natassa Substitute lecturer Christos Faloutsos [email protected] www.cs.cmu.edu/~christos 15-721 Copyright: C. Faloutsos (2001) 2 Outline • Problem • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees • Unsupervised learning – association rules – (clustering) 15-721 Copyright: C. Faloutsos (2001) 3 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) PGH NY sales(p-id, c-id, date, $price) ??? customers( c-id, age, income, ...) SF 15-721 Copyright: C. Faloutsos (2001) 4 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / nonhomegeneities? 15-721 Copyright: C. Faloutsos (2001) 5 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / nonhomegeneities? A: Wrappers/Mediators 15-721 Copyright: C. Faloutsos (2001) 6 Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.: 15-721 Copyright: C. Faloutsos (2001) 7 OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales C/S S M L TOT Red 20 3 5 28 ci-d p-id Size Color $ C10 Shirt L Blue 30 Blue 3 3 8 14 C10 Pants XL Red 50 Gray 0 0 5 5 C20 Shirt White 20 TOT 23 6 18 47 XL ... 15-721 Copyright: C. Faloutsos (2001) 8 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 9 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 10 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 11 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 12 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 13 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color; size color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 DataCube 15-721 Copyright: C. Faloutsos (2001) 14 DataCubes SQL query to generate DataCube: • Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) from sales where p-id = ‘shirt’ group by size ... 15-721 Copyright: C. Faloutsos (2001) 15 DataCubes SQL query to generate DataCube: • with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color 15-721 Copyright: C. Faloutsos (2001) 16 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow 15-721 Copyright: C. Faloutsos (2001) 17 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber] 15-721 Copyright: C. Faloutsos (2001) 18 DataCubes Q1: How to store a dataCube? 15-721 C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 19 DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP) C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Blue 'all' 14 Gray 0 0 5 5 Blue M TOT 23 6 18 47 Color Size count 'all' 'all' 47 3 … 15-721 Copyright: C. Faloutsos (2001) 20 DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) C/S S M A3: Hybrid (H-OLAP) 15-721 L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 21 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) 15-721 Copyright: C. Faloutsos (2001) 22 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) • use existing RDBMS technology • scale up better with dimensionality 15-721 Copyright: C. Faloutsos (2001) 23 DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) • faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) • detail data in ROLAP; summaries in MOLAP 15-721 Copyright: C. Faloutsos (2001) 24 DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001) 25 DataCubes Q2: What operations should we support? f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 26 DataCubes Q2: What operations should we support? Roll-up f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 27 DataCubes Q2: What operations should we support? Drill-down f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 28 DataCubes Q2: What operations should we support? Slice f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 29 DataCubes Q2: What operations should we support? Dice f size color; size 15-721 color C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 30 DataCubes Q2: What operations should we support? • Roll-up • Drill-down • Slice • Dice 15-721 Copyright: C. Faloutsos (2001) 31 DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001) 32 DataCubes Q3: How to index a dataCube? 15-721 C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 33 DataCubes Q3: How to index a dataCube? A1: Bitmaps S 1 1 M … 1 … 15-721 L … Red Blue Gray 1 1 1 … … … C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 34 DataCubes Q3: How to index a dataCube? A2: Join indices (see [Han+Kamber]) 15-721 C/S S M L TOT Red 20 3 5 28 Blue 3 3 8 14 Gray 0 0 5 5 TOT 23 6 18 47 Copyright: C. Faloutsos (2001) 35 D/W - OLAP - Conclusions • D/W: copy (summarized) data + analyze • OLAP - concepts: – DataCube – R/M/H-OLAP servers – ‘dimensions’; ‘measures’ 15-721 Copyright: C. Faloutsos (2001) 36 Outline • Problem • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees • Unsupervised learning – association rules – (clustering) 15-721 Copyright: C. Faloutsos (2001) 37 Decision trees - Problem Age Chol-level Gender … 30 150 M CLASS-ID + … - ?? 15-721 Copyright: C. Faloutsos (2001) 38 Decision trees • Pictorially, we have num. attr#2 (eg., chol-level) + + + + + + - - - - + - num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 39 Decision trees • and we want to label ‘?’ ? num. attr#2 (eg., chol-level) + + + + + + - - - - + - num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 40 Decision trees • so we build a decision tree: ? num. attr#2 (eg., chol-level) 40 + + + + + + - - - - + - 50 num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 41 Decision trees • so we build a decision tree: age<50 N Y + Y - 15-721 chol. <40 N ... Copyright: C. Faloutsos (2001) 42 Outline • Problem • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees – problem – approach – scalability enhancements • Unsupervised learning – association rules – (clustering) 15-721 Copyright: C. Faloutsos (2001) 43 Decision trees • Typically, two steps: – tree building – tree pruning (for over-training/over-fitting) 15-721 Copyright: C. Faloutsos (2001) 44 Tree building • How? num. attr#2 (eg., chol-level) + -+ + + + ++ num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 45 Tree building • How? • A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) 15-721 Copyright: C. Faloutsos (2001) 46 Tree building • Q1: how to introduce splits along attribute Ai • Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001) 47 Tree building • Q1: how to introduce splits along attribute Ai • A1: – for num. attributes: • binary split, or • multiple split – for categorical attributes: • compute all subsets (expensive!), or • use a greedy algo 15-721 Copyright: C. Faloutsos (2001) 48 Tree building • Q1: how to introduce splits along attribute Ai • Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001) 49 Tree building • Q1: how to introduce splits along attribute Ai • Q2: how to evaluate a split? • A: by how close to uniform each subset is ie., we need a measure of uniformity: 15-721 Copyright: C. Faloutsos (2001) 50 Tree building entropy: H(p+, p-) Any other measure? 1 0 0 15-721 0.5 1 p+ Copyright: C. Faloutsos (2001) 51 Tree building ‘gini’ index: 1-p+2 - p-2 entropy: H(p+, p-) 1 1 0 0 0 15-721 0.5 1 p+ 0 Copyright: C. Faloutsos (2001) 0.5 1 p+ 52 Tree building entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-2 (How about multiple labels?) 15-721 Copyright: C. Faloutsos (2001) 53 Tree building Intuition: • entropy: #bits to encode the class label • gini: classification error, if we randomly guess ‘+’ with prob. p+ 15-721 Copyright: C. Faloutsos (2001) 54 Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#2 (eg., chol-level) + -+ + ++ + - - + num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 55 Tree building • Before split: we need (n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels • After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half 15-721 Copyright: C. Faloutsos (2001) 56 Tree pruning • What for? num. attr#2 (eg., chol-level) + -+ + ++ + - - + ... num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001) 57 Tree pruning Shortcut for scalability: DYNAMIC pruning: • stop expanding the tree, if a node is ‘reasonably’ homogeneous – ad hoc threshold [Agrawal+, vldb92] – Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] 15-721 Copyright: C. Faloutsos (2001) 58 Tree pruning • Q: How to do it? • A1: use a ‘training’ and a ‘testing’ set prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) • A2: or, rely on MDL (= Minimum Description Language) - in detail: 15-721 Copyright: C. Faloutsos (2001) 59 Tree pruning • envision the problem as compression (of what?) 15-721 Copyright: C. Faloutsos (2001) 60 Tree pruning • envision the problem as compression (of what?) • and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree 15-721 Copyright: C. Faloutsos (2001) 61 (MDL) • a brilliant idea - eg.: best n-degree polynomial to compress these points: • the one that minimizes (sum of errors + n ) 15-721 Copyright: C. Faloutsos (2001) 62 Outline • Problem • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees – problem – approach – scalability enhancements • Unsupervised learning – association rules – (clustering) 15-721 Copyright: C. Faloutsos (2001) 63 Scalability enhancements • Interval Classifier [Agrawal+,vldb92]: dynamic pruning • SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) • SPRINT: even more clever partitioning 15-721 Copyright: C. Faloutsos (2001) 64 Conclusions for classifiers • • • • Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: – dynamic pruning – clever data partitioning 15-721 Copyright: C. Faloutsos (2001) 65 Overall Conclusions • Data Mining: of high commercial interest • DM = DB+ ML+ Stat • • • • Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS) 15-721 Copyright: C. Faloutsos (2001) 66 Reading material • Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD 1993. • M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996 15-721 Copyright: C. Faloutsos (2001) 67 Additional references • Agrawal, R., S. Ghosh, et al. (Aug. 23-27, 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. • Jiawei Han and Micheline Kamber, Data Mining , Morgan Kaufman, 2001, chapters 2.2-2.3, 6.1-6.2, 7.3.5 15-721 Copyright: C. Faloutsos (2001) 68