Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Mining for Queries Florin Rusu Mentors: Vijayshankar Raman, Lin Qiao, Peter Haas Manager: Guy Lohman Motivation Hardware trends Multi-core processors Memory capacity Query execution improvement [RSQ08] Parallel, in-memory databases Compression, scans Goal Use the query power 2 Exploratory Analysis Hypothesis-driven Online aggregation [HHW97] Data cube exploration Discovery-driven Materialized data cube [SAM98] Goals Automatic exploratory analysis 3 Cell Phone Call Data Warehouse Table Calls Latitude Longitude Call Time Call Duration Call Type Call Status 38N 123W 9:20 10:03 International OK 38N 122W 20:48 20:20 Long Drop 38N 120W 16:22 0:48 Local OK … … … … … … 38N 70W 12:29 0:32 Cell Drop 4 Interesting Query (1) What are the time intervals for which the duration of long distance calls is significantly high? 4000 3000 Local Long International Cell 2000 1000 0 100000 0:00-5:59 40000 30000 Local Long International Cell 20000 10000 80000 Local Long International Cell 60000 0 6:00-11:59 40000 100000 80000 Local Long International Cell 60000 40000 20000 0 20000 0 12:00-17:59 16:00-21:59 100000 80000 Local Long International Cell 60000 40000 20000 0 18:00-23:59 5 Interesting Query (2) What are the areas with large fractions of dropped calls? Non-axes aligned hyper-rectangles 6 Query Pattern Example SELECT SUM(Call Duration) FROM Calls WHERE (16:00 < Call Time < 22:00) AND (10 < 0.65*Latitude-0.35*Longitude < 100) GROUP BY Call Type General pattern SELECT AGG(G) FROM T WHERE L1 < P1 < U1 AND … AND Ln < Pn < Un GROUP BY G 7 Problem Given a data warehouse find the most interesting regions in the attribute space (P1,…,Pn) according to a function F and that have sufficient support Searching problem over hyper-rectangular regions in the attribute space argmaxP1,…,Pn F(P1,…,Pn,G) COUNT(P1,…,Pn) > S Function F Value one group / Average value other groups 8 Exhaustive Search Try all axes-aligned hyper-rectangles 10 attributes with 10 values in the domain give 1020 hyper-rectangles Alternatives Run independent queries Pre-computed prefix-sum array [HAMS97] Top-down search Iceberg pruning on support [BR99] 9 Incremental Approach Solve for each group separately and combine the results F is assumed locally smooth Bottom-Up search Find local maxima points for F Extend region around local maxima Verify function F Verify support 10 Axes-Aligned Find local maxima points for F 11 Axes-Aligned For each local maxima 12 Axes-Aligned Extend along axes 13 Axes-Aligned Maximal region 14 Principal Component Analysis Find a new base with vectors corresponding to variance along axes in the original data Steps Mean-center data along each dimension Compute covariance matrix Find eigen-vectors and eigen-values 15 Non-Axes-Aligned Find local maxima points for F 16 Non-Axes-Aligned For each local maxima 17 Non-Axes-Aligned Extend along eigen-vectors computed by PCA 18 Non-Axes-Aligned Maximal region 19 Query Parameters Aggregate fraction Support AGG(g) / ΣAGG(G) COUNT(g) Density COUNT DISTINCT(g) / TOTAL POINTS 20 Implementation Use Blink [RSQ08] as query engine Use queries extensively Function evaluation Region expansion Overlap multiple queries 21 Questions Comments Suggestions 22