Download datacubes

Document related concepts
no text concepts found
Transcript
15-721
Database Management Systems
Data Mining - Data Cubes
Anastassia Ailamaki
http://www.cs.cmu.edu/~natassa
Substitute lecturer
Christos Faloutsos
[email protected]
www.cs.cmu.edu/~christos
15-721
Copyright: C. Faloutsos (2001)
2
Outline
• Problem
• Getting the data: Data Warehouses, DataCubes,
OLAP
• Supervised learning: decision trees
• Unsupervised learning
– association rules
– (clustering)
15-721
Copyright: C. Faloutsos (2001)
3
Problem
Given: multiple data sources
Find: patterns (classifiers, rules, clusters, outliers...)
PGH
NY
sales(p-id, c-id, date, $price)
???
customers( c-id, age, income, ...)
SF
15-721
Copyright: C. Faloutsos (2001)
4
Data Ware-housing
First step: collect the data, in a single place (=
Data Warehouse)
How?
How often?
How about discrepancies / nonhomegeneities?
15-721
Copyright: C. Faloutsos (2001)
5
Data Ware-housing
First step: collect the data, in a single place (=
Data Warehouse)
How?
A: Triggers/Materialized views
How often? A: [Art!]
How about discrepancies / nonhomegeneities?
A: Wrappers/Mediators
15-721
Copyright: C. Faloutsos (2001)
6
Data Ware-housing
Step 2: collect counts. (DataCubes/OLAP)
Eg.:
15-721
Copyright: C. Faloutsos (2001)
7
OLAP
Problem: “is it true that shirts in large sizes sell
better in dark colors?”
sales
C/S S
M
L
TOT
Red 20
3
5
28
ci-d
p-id
Size
Color $
C10
Shirt
L
Blue
30
Blue 3
3
8
14
C10
Pants XL
Red
50
Gray 0
0
5
5
C20
Shirt
White 20
TOT 23
6
18
47
XL
...
15-721
Copyright: C. Faloutsos (2001)
8
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
9
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
10
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
11
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
12
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
13
DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
f
size
color; size
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
DataCube
15-721
Copyright: C. Faloutsos (2001)
14
DataCubes
SQL query to generate DataCube:
• Naively (and painfully:)
select size, color, count(*)
from sales where p-id = ‘shirt’
group by size, color
select size, count(*)
from sales where p-id = ‘shirt’
group by size
...
15-721
Copyright: C. Faloutsos (2001)
15
DataCubes
SQL query to generate DataCube:
• with ‘cube by’ keyword:
select size, color, count(*)
from sales
where p-id = ‘shirt’
cube by size, color
15-721
Copyright: C. Faloutsos (2001)
16
DataCubes
DataCube issues:
Q1: How to store them (and/or materialize
portions on demand)
Q2: How to index them
Q3: Which operations to allow
15-721
Copyright: C. Faloutsos (2001)
17
DataCubes
DataCube issues:
Q1: How to store them (and/or materialize
portions on demand) A: ROLAP/MOLAP
Q2: How to index them A: bitmap indices
Q3: Which operations to allow A: roll-up, drill
down, slice, dice
[More details: book by Han+Kamber]
15-721
Copyright: C. Faloutsos (2001)
18
DataCubes
Q1: How to store a dataCube?
15-721
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
19
DataCubes
Q1: How to store a dataCube?
A1: Relational (R-OLAP)
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Blue 'all' 14
Gray 0
0
5
5
Blue M
TOT 23
6
18
47
Color Size count
'all'
'all' 47
3
…
15-721
Copyright: C. Faloutsos (2001)
20
DataCubes
Q1: How to store a dataCube?
A2: Multi-dimensional (M-OLAP)
C/S S
M
A3: Hybrid (H-OLAP)
15-721
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
21
DataCubes
Pros/Cons:
ROLAP strong points: (DSS, Metacube)
15-721
Copyright: C. Faloutsos (2001)
22
DataCubes
Pros/Cons:
ROLAP strong points: (DSS, Metacube)
• use existing RDBMS technology
• scale up better with dimensionality
15-721
Copyright: C. Faloutsos (2001)
23
DataCubes
Pros/Cons:
MOLAP strong points: (EssBase/hyperion.com)
• faster indexing
(careful with: high-dimensionality; sparseness)
HOLAP: (MS SQL server OLAP services)
• detail data in ROLAP; summaries in MOLAP
15-721
Copyright: C. Faloutsos (2001)
24
DataCubes
Q1: How to store a dataCube
Q2: What operations should we support?
Q3: How to index a dataCube?
15-721
Copyright: C. Faloutsos (2001)
25
DataCubes
Q2: What operations should we support?
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
26
DataCubes
Q2: What operations should we support?
Roll-up
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
27
DataCubes
Q2: What operations should we support?
Drill-down
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
28
DataCubes
Q2: What operations should we support?
Slice
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
29
DataCubes
Q2: What operations should we support?
Dice
f
size
color; size
15-721
color
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
30
DataCubes
Q2: What operations should we support?
• Roll-up
• Drill-down
• Slice
• Dice
15-721
Copyright: C. Faloutsos (2001)
31
DataCubes
Q1: How to store a dataCube
Q2: What operations should we support?
Q3: How to index a dataCube?
15-721
Copyright: C. Faloutsos (2001)
32
DataCubes
Q3: How to index a dataCube?
15-721
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
33
DataCubes
Q3: How to index a dataCube?
A1: Bitmaps
S
1
1
M
…
1
…
15-721
L
…
Red Blue Gray
1
1
1
… … …
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
34
DataCubes
Q3: How to index a dataCube?
A2: Join indices (see [Han+Kamber])
15-721
C/S S
M
L
TOT
Red 20
3
5
28
Blue 3
3
8
14
Gray 0
0
5
5
TOT 23
6
18
47
Copyright: C. Faloutsos (2001)
35
D/W - OLAP - Conclusions
• D/W: copy (summarized) data + analyze
• OLAP - concepts:
– DataCube
– R/M/H-OLAP servers
– ‘dimensions’; ‘measures’
15-721
Copyright: C. Faloutsos (2001)
36
Outline
• Problem
• Getting the data: Data Warehouses, DataCubes,
OLAP
• Supervised learning: decision trees
• Unsupervised learning
– association rules
– (clustering)
15-721
Copyright: C. Faloutsos (2001)
37
Decision trees - Problem
Age Chol-level Gender …
30
150
M
CLASS-ID
+
…
-
??
15-721
Copyright: C. Faloutsos (2001)
38
Decision trees
• Pictorially, we have
num. attr#2
(eg., chol-level)
+
+
+
+
+ +
-
- -
-
+
-
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
39
Decision trees
• and we want to label ‘?’
?
num. attr#2
(eg., chol-level)
+
+
+
+
+ +
-
- -
-
+
-
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
40
Decision trees
• so we build a decision tree:
?
num. attr#2
(eg., chol-level)
40
+
+
+
+
+ +
-
- -
-
+
-
50
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
41
Decision trees
• so we build a decision tree:
age<50
N
Y
+
Y
-
15-721
chol. <40
N
...
Copyright: C. Faloutsos (2001)
42
Outline
• Problem
• Getting the data: Data Warehouses, DataCubes,
OLAP
• Supervised learning: decision trees
– problem
– approach
– scalability enhancements
• Unsupervised learning
– association rules
– (clustering)
15-721
Copyright: C. Faloutsos (2001)
43
Decision trees
• Typically, two steps:
– tree building
– tree pruning (for over-training/over-fitting)
15-721
Copyright: C. Faloutsos (2001)
44
Tree building
• How?
num. attr#2
(eg., chol-level)
+ -+ +
+
+
++
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
45
Tree building
• How?
• A: Partition, recursively - pseudocode:
Partition ( Dataset S)
if all points in S have same label
then return
evaluate splits along each attribute A
pick best split, to divide S into S1 and S2
Partition(S1); Partition(S2)
15-721
Copyright: C. Faloutsos (2001)
46
Tree building
• Q1: how to introduce splits along attribute
Ai
• Q2: how to evaluate a split?
15-721
Copyright: C. Faloutsos (2001)
47
Tree building
• Q1: how to introduce splits along attribute Ai
• A1:
– for num. attributes:
• binary split, or
• multiple split
– for categorical attributes:
• compute all subsets (expensive!), or
• use a greedy algo
15-721
Copyright: C. Faloutsos (2001)
48
Tree building
• Q1: how to introduce splits along attribute
Ai
• Q2: how to evaluate a split?
15-721
Copyright: C. Faloutsos (2001)
49
Tree building
• Q1: how to introduce splits along attribute
Ai
• Q2: how to evaluate a split?
• A: by how close to uniform each subset is ie., we need a measure of uniformity:
15-721
Copyright: C. Faloutsos (2001)
50
Tree building
entropy: H(p+, p-)
Any other measure?
1
0
0
15-721
0.5
1 p+
Copyright: C. Faloutsos (2001)
51
Tree building
‘gini’ index: 1-p+2 - p-2
entropy: H(p+, p-)
1
1
0
0
0
15-721
0.5
1 p+
0
Copyright: C. Faloutsos (2001)
0.5
1 p+
52
Tree building
entropy: H(p+, p-)
‘gini’ index: 1-p+2 - p-2
(How about multiple labels?)
15-721
Copyright: C. Faloutsos (2001)
53
Tree building
Intuition:
• entropy: #bits to encode the class label
• gini: classification error, if we randomly
guess ‘+’ with prob. p+
15-721
Copyright: C. Faloutsos (2001)
54
Tree building
Thus, we choose the split that reduces
entropy/classification-error the most: Eg.:
num. attr#2
(eg., chol-level)
+ -+ +
++ + - - +
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
55
Tree building
• Before split: we need
(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)
bits total, to encode all the class labels
• After the split we need:
0 bits
for the first half and
(2+6) * H(2/8, 6/8) bits for the second half
15-721
Copyright: C. Faloutsos (2001)
56
Tree pruning
• What for?
num. attr#2
(eg., chol-level)
+ -+ +
++ + - - +
...
num. attr#1 (eg., ‘age’)
15-721
Copyright: C. Faloutsos (2001)
57
Tree pruning
Shortcut for scalability: DYNAMIC pruning:
• stop expanding the tree, if a node is
‘reasonably’ homogeneous
– ad hoc threshold [Agrawal+, vldb92]
– Minimum Description Language (MDL)
criterion (SLIQ) [Mehta+, edbt96]
15-721
Copyright: C. Faloutsos (2001)
58
Tree pruning
• Q: How to do it?
• A1: use a ‘training’ and a ‘testing’ set prune nodes that improve classification in
the ‘testing’ set. (Drawbacks?)
• A2: or, rely on MDL (= Minimum
Description Language) - in detail:
15-721
Copyright: C. Faloutsos (2001)
59
Tree pruning
• envision the problem as compression (of
what?)
15-721
Copyright: C. Faloutsos (2001)
60
Tree pruning
• envision the problem as compression (of
what?)
• and try to min. the # bits to compress
(a) the class labels AND
(b) the representation of the decision tree
15-721
Copyright: C. Faloutsos (2001)
61
(MDL)
• a brilliant idea - eg.: best n-degree
polynomial to compress these points:
• the one that minimizes (sum of errors + n )
15-721
Copyright: C. Faloutsos (2001)
62
Outline
• Problem
• Getting the data: Data Warehouses, DataCubes,
OLAP
• Supervised learning: decision trees
– problem
– approach
– scalability enhancements
• Unsupervised learning
– association rules
– (clustering)
15-721
Copyright: C. Faloutsos (2001)
63
Scalability enhancements
• Interval Classifier [Agrawal+,vldb92]:
dynamic pruning
• SLIQ: dynamic pruning with MDL; vertical
partitioning of the file (but label column has
to fit in core)
• SPRINT: even more clever partitioning
15-721
Copyright: C. Faloutsos (2001)
64
Conclusions for classifiers
•
•
•
•
Classification through trees
Building phase - splitting policies
Pruning phase (to avoid over-fitting)
For scalability:
– dynamic pruning
– clever data partitioning
15-721
Copyright: C. Faloutsos (2001)
65
Overall Conclusions
• Data Mining: of high commercial interest
• DM = DB+ ML+ Stat
•
•
•
•
Data warehousing / OLAP: to get the data
Tree classifiers (SLIQ, SPRINT)
Association Rules - ‘a-priori’ algorithm
(clustering: BIRCH, CURE, OPTICS)
15-721
Copyright: C. Faloutsos (2001)
66
Reading material
• Agrawal, R., T. Imielinski, A. Swami, ‘Mining
Association Rules between Sets of Items in Large
Databases’, SIGMOD 1993.
• M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast
Scalable Classifier for Data Mining', Proc. of the Fifth Int'l
Conference on Extending Database Technology (EDBT),
Avignon, France, March 1996
15-721
Copyright: C. Faloutsos (2001)
67
Additional references
• Agrawal, R., S. Ghosh, et al. (Aug. 23-27, 1992). An
Interval Classifier for Database Mining Applications.
VLDB Conf. Proc., Vancouver, BC, Canada.
• Jiawei Han and Micheline Kamber, Data Mining , Morgan
Kaufman, 2001, chapters 2.2-2.3, 6.1-6.2, 7.3.5
15-721
Copyright: C. Faloutsos (2001)
68