Download Mining for Interesting Queries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining for Queries
Florin Rusu
Mentors: Vijayshankar Raman, Lin Qiao,
Peter Haas
Manager: Guy Lohman
Motivation

Hardware trends
Multi-core processors
 Memory capacity


Query execution improvement [RSQ08]
Parallel, in-memory databases
 Compression, scans


Goal

Use the query power
2
Exploratory Analysis

Hypothesis-driven
Online aggregation [HHW97]
 Data cube exploration


Discovery-driven


Materialized data cube [SAM98]
Goals

Automatic exploratory analysis
3
Cell Phone Call Data Warehouse
Table Calls
Latitude
Longitude
Call Time
Call Duration
Call Type
Call Status
38N
123W
9:20
10:03
International
OK
38N
122W
20:48
20:20
Long
Drop
38N
120W
16:22
0:48
Local
OK
…
…
…
…
…
…
38N
70W
12:29
0:32
Cell
Drop
4
Interesting Query (1)
What are the time intervals for which the duration of long distance calls is
significantly high?
4000
3000
Local
Long
International
Cell
2000
1000
0
100000
0:00-5:59
40000
30000
Local
Long
International
Cell
20000
10000
80000
Local
Long
International
Cell
60000
0
6:00-11:59
40000
100000
80000
Local
Long
International
Cell
60000
40000
20000
0
20000
0
12:00-17:59
16:00-21:59
100000
80000
Local
Long
International
Cell
60000
40000
20000
0
18:00-23:59
5
Interesting Query (2)
What are the areas with large fractions of dropped calls?
Non-axes aligned hyper-rectangles
6
Query Pattern
Example
SELECT SUM(Call Duration)
FROM Calls
WHERE (16:00 < Call Time < 22:00) AND
(10 < 0.65*Latitude-0.35*Longitude < 100)
GROUP BY Call Type
General pattern
SELECT AGG(G)
FROM T
WHERE L1 < P1 < U1 AND … AND Ln < Pn < Un
GROUP BY G
7
Problem


Given a data warehouse find the most interesting
regions in the attribute space (P1,…,Pn) according to a
function F and that have sufficient support
Searching problem over hyper-rectangular regions in
the attribute space
argmaxP1,…,Pn F(P1,…,Pn,G)
COUNT(P1,…,Pn) > S

Function F

Value one group / Average value other groups
8
Exhaustive Search

Try all axes-aligned hyper-rectangles


10 attributes with 10 values in the domain
give 1020 hyper-rectangles
Alternatives
Run independent queries
 Pre-computed prefix-sum array [HAMS97]


Top-down search

Iceberg pruning on support [BR99]
9
Incremental Approach
Solve for each group separately and
combine the results
 F is assumed locally smooth
 Bottom-Up search





Find local maxima points for F
Extend region around local maxima
Verify function F
Verify support
10
Axes-Aligned
Find local maxima points for F
11
Axes-Aligned
For each local maxima
12
Axes-Aligned
Extend along axes
13
Axes-Aligned
Maximal region
14
Principal Component Analysis
Find a new base with vectors
corresponding to variance along axes in
the original data
 Steps

Mean-center data along each dimension
 Compute covariance matrix
 Find eigen-vectors and eigen-values

15
Non-Axes-Aligned
Find local maxima points for F
16
Non-Axes-Aligned
For each local maxima
17
Non-Axes-Aligned
Extend along eigen-vectors computed by PCA
18
Non-Axes-Aligned
Maximal region
19
Query Parameters

Aggregate fraction


Support


AGG(g) / ΣAGG(G)
COUNT(g)
Density

COUNT DISTINCT(g) / TOTAL POINTS
20
Implementation
Use Blink [RSQ08] as query engine
 Use queries extensively

Function evaluation
 Region expansion
 Overlap multiple queries

21
Questions
Comments
Suggestions
22
Related documents