Download Mining for Interesting Queries

Mining for Queries Florin Rusu Mentors: Vijayshankar Raman, Lin Qiao, Peter Haas Manager: Guy Lohman Motivation  Hardware trends Multi-core processors  Memory capacity   Query execution improvement [RSQ08] Parallel, in-memory databases  Compression, scans   Goal  Use the query power 2 Exploratory Analysis  Hypothesis-driven Online aggregation [HHW97]  Data cube exploration   Discovery-driven   Materialized data cube [SAM98] Goals  Automatic exploratory analysis 3 Cell Phone Call Data Warehouse Table Calls Latitude Longitude Call Time Call Duration Call Type Call Status 38N 123W 9:20 10:03 International OK 38N 122W 20:48 20:20 Long Drop 38N 120W 16:22 0:48 Local OK … … … … … … 38N 70W 12:29 0:32 Cell Drop 4 Interesting Query (1) What are the time intervals for which the duration of long distance calls is significantly high? 4000 3000 Local Long International Cell 2000 1000 0 100000 0:00-5:59 40000 30000 Local Long International Cell 20000 10000 80000 Local Long International Cell 60000 0 6:00-11:59 40000 100000 80000 Local Long International Cell 60000 40000 20000 0 20000 0 12:00-17:59 16:00-21:59 100000 80000 Local Long International Cell 60000 40000 20000 0 18:00-23:59 5 Interesting Query (2) What are the areas with large fractions of dropped calls? Non-axes aligned hyper-rectangles 6 Query Pattern Example SELECT SUM(Call Duration) FROM Calls WHERE (16:00 < Call Time < 22:00) AND (10 < 0.65*Latitude-0.35*Longitude < 100) GROUP BY Call Type General pattern SELECT AGG(G) FROM T WHERE L1 < P1 < U1 AND … AND Ln < Pn < Un GROUP BY G 7 Problem   Given a data warehouse find the most interesting regions in the attribute space (P1,…,Pn) according to a function F and that have sufficient support Searching problem over hyper-rectangular regions in the attribute space argmaxP1,…,Pn F(P1,…,Pn,G) COUNT(P1,…,Pn) > S  Function F  Value one group / Average value other groups 8 Exhaustive Search  Try all axes-aligned hyper-rectangles   10 attributes with 10 values in the domain give 1020 hyper-rectangles Alternatives Run independent queries  Pre-computed prefix-sum array [HAMS97]   Top-down search  Iceberg pruning on support [BR99] 9 Incremental Approach Solve for each group separately and combine the results  F is assumed locally smooth  Bottom-Up search      Find local maxima points for F Extend region around local maxima Verify function F Verify support 10 Axes-Aligned Find local maxima points for F 11 Axes-Aligned For each local maxima 12 Axes-Aligned Extend along axes 13 Axes-Aligned Maximal region 14 Principal Component Analysis Find a new base with vectors corresponding to variance along axes in the original data  Steps  Mean-center data along each dimension  Compute covariance matrix  Find eigen-vectors and eigen-values  15 Non-Axes-Aligned Find local maxima points for F 16 Non-Axes-Aligned For each local maxima 17 Non-Axes-Aligned Extend along eigen-vectors computed by PCA 18 Non-Axes-Aligned Maximal region 19 Query Parameters  Aggregate fraction   Support   AGG(g) / ΣAGG(G) COUNT(g) Density  COUNT DISTINCT(g) / TOTAL POINTS 20 Implementation Use Blink [RSQ08] as query engine  Use queries extensively  Function evaluation  Region expansion  Overlap multiple queries  21 Questions Comments Suggestions 22

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining for Interesting Queries