Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSG230 Summary
Donghui Zhang
2017年5月5日星期五
Data Mining: Concepts and Techniques
1
What we learned?
1.
Frequent pattern & association
2.
Clustering
3.
Classification
4.
Data warehousing
5.
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
2
What we learned?
1.
Frequent pattern & association
frequent itemsets (Apriori, FP-growth)
max and closed itemsets
association rules
essential rules
generalized itemsets
Sequential pattern
2.
Clustering
3.
Classification
4.
Data warehousing
5.
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
3
What we learned?
1.
Frequent pattern & association
2.
Clustering
k-means
Birch (based on CF-tree)
DBSCAN
CURE
3.
Classification
4.
Data warehousing
5.
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
4
What we learned?
1.
Frequent pattern & association
2.
Clustering
3.
Classification
decision tree
naïve Baysian classifier
Baysian network
neural net and SVM
4.
Data warehousing
5.
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
5
What we learned?
1.
Frequent pattern & association
2.
Clustering
3.
Classification
4.
Data warehousing
5.
concept, schema
data cube & operations (rollup, …)
cube computation: multi-way array aggregation
iceberg cube
dynamic data cube
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
6
What we learned?
1.
Frequent pattern & association
2.
Clustering
3.
Classification
4.
Data warehousing
5.
Additional
lattice (of itemsets, g-itemsets, rules, cuboids)
distance-based indexing
2017年5月5日星期五
Data Mining: Concepts and Techniques
7
1.
Frequent pattern & association
frequent itemsets (Apriori, FP-growth)
max and closed itemsets
association rules
essential rules
generalized itemsets
Sequential pattern
2017年5月5日星期五
Data Mining: Concepts and Techniques
8
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Itemset X={x1, …, xk}
Find all the rules XY with min
confidence and support
Customer
buys both
Customer
buys beer
2017年5月5日星期五
Customer
buys diaper
support, s, probability that a
transaction contains XY
confidence, c, conditional
probability that a transaction
having X also contains Y.
Let min_support = 50%,
min_conf = 50%:
A  C (50%, 66.7%)
C  A (50%, 100%)
Data Mining: Concepts and Techniques
9
From Mining Association Rules to Mining
Frequent Patterns (i.e. Frequent Itemsets)
Given a frequent itemset X, how to find association
rules?
Examine every subset S of X.
Confidence(S  X – S ) = support(X)/support(S)
Compare with min_conf
An optimization is possible (refer to exercises 6.1, 6.2).
2017年5月5日星期五
Data Mining: Concepts and Techniques
10
The Apriori Algorithm—An Example
Itemset
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
C1
1st scan
C2
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
2017年5月5日星期五
3rd scan
L3
Itemset
{B, C, E}
sup
2
Data Mining: Concepts and Techniques
11
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
Pruning:
abcd from abc and abd
acde from acd and ace
acde is removed because ade is not in L3
C4={abcd}
2017年5月5日星期五
Data Mining: Concepts and Techniques
12
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
min_support = 3
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
13
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:1
c:1
a:1
m:1
p:1
14
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:2
c:2
a:2
m:1
b:1
p:1
m:1
15
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:3
c:2
b:1
a:2
m:1
b:1
p:1
m:1
16
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:3
c:2
c:1
b:1
a:2
b:1
p:1
m:1
b:1
p:1
m:1
17
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
2017年5月5日星期五
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
18
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
2017年5月5日星期五
f:4
c:3
c:1
b:1
a:3
Conditional pattern bases
item
cond. pattern base
b:1
c
f:3
p:1
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
Data Mining: Concepts and Techniques
19
Max-patterns
Frequent pattern {a1, …, a100}  (1001) + (1002)
+ … + (110000) = 2100-1 = 1.27*1030 frequent subpatterns!
Max-pattern: frequent patterns without proper
frequent super pattern
 BCDE, ACD are max-patterns
Tid Items
 BCD is not a max-pattern
Min_sup=2
2017年5月5日星期五
Data Mining: Concepts and Techniques
10
A,B,C,D,E
20
30
B,C,D,E,
A,C,D,F
20
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
Items
Frequency
A
2
B
2
C
3
D
3
E
2
F
1
ABCDE
0
2017年5月5日星期五
D (E)
E ()
Tid
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Max patterns:
Data Mining: Concepts and Techniques
21
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
D (E)
E ()
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Node A
2017年5月5日星期五
Tid
Items
Frequency
AB
1
AC
2
AD
2
AE
1
ACD
2
Data Mining: Concepts and Techniques
Max patterns:
22
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
D (E)
E ()
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Node A
2017年5月5日星期五
Tid
Items
Frequency
AB
1
AC
2
AD
2
AE
1
ACD
2
Data Mining: Concepts and Techniques
Max patterns:
ACD
23
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
D (E)
E ()
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Node B
2017年5月5日星期五
Tid
Items
Frequency
BCDE
2
Data Mining: Concepts and Techniques
Max patterns:
ACD
24
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
D (E)
E ()
Tid
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Node B
Items
Frequency
BCDE
2
Max patterns:
ACD
BCDE
2017年5月5日星期五
Data Mining: Concepts and Techniques
25
Example
 (ABCDEF)
A (BCDE) B (CDE) C (DE)
D (E)
E ()
Tid
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Max patterns:
ACD
BCDE
2017年5月5日星期五
Data Mining: Concepts and Techniques
26
A Critical Observation
Rule
Support
Confidence
A → BC
sup(ABC)
sup(ABC)/sup(A)
AB → C
sup(ABC)
sup(ABC)/sup(AB)
AC → B
sup(ABC)
sup(ABC)/sup(AC)
A→B
sup(AB)
sup(AB)/sup(A)
A→C
sup(AC)
sup(AC)/sup(A)
A → BC has smaller support and confidence than the other rules,
independent to the TDB.
Rules AB → C, AC → B, A → B and A → C are redundant with
regard to A → BC.
While mining association rules, a large percentage of rules may be
redundant.
2017年5月5日星期五
Data Mining: Concepts and Techniques
27
Formal Definition of Essential Rule
Definition 1 Rule r1 implies another rule r2 if
support(r1)≤support(r2) and confidence(r1)≤
confidence(r2) independent to TDB.
Denote as r1  r2
Definition 2 Rule r1 is an essential rule if r1 is strong
and  r2 s.t. r2  r1 .
2017年5月5日星期五
Data Mining: Concepts and Techniques
28
Example of a Lattice of rules
ABC
ABC
AC AB
CAB
BAC
ABC ACB BC BA
BCA
CB
CA
• Generate the child nodes: move or delete from
the consequent.
• To find essential rules: start from each max
itemset; browse top-down; prune a sub-tree
whenever a rule is confident.
2017年5月5日星期五
Data Mining: Concepts and Techniques
29
Frequent generalized itemsets
A taxonomy of items.
TDB involves leaf items in the taxonomy.
A g-itemset may contain g-items, but cannot
contain an ancestor and a descendant at the
same time.
!! A descendant g-item is a “superset”!!
Anyone who bought {milk, bread} also
bought {milk}.
Anyone who bought {A} also bought {W}.
?? how to find frequent g-itemsets?
Browse (and prune) a lattice of g-itemsets!
To get children, replace one item by its
ancestor (if conflicts, remove instead.)
2017年5月5日星期五
Data Mining: Concepts and Techniques
30
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence database
Given support threshold
min_sup =2, <(ab)c> is a
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
2017年5月5日星期五
sequential pattern
Data Mining: Concepts and Techniques
31
Mining Sequential Patterns by Prefix
Projections
Step 1: find length-1 sequential patterns
 <a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat.
can be partitioned into 6 subsets:
 The ones having prefix <a>;
 The ones having prefix <b>;
SID
sequence
 …
10
<a(abc)(ac)d(cf)>
 The ones having prefix <f>
20
<(ad)c(bc)(ae)>
2017年5月5日星期五
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Data Mining: Concepts and Techniques
32
Finding Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>,
by checking the frequency of items like a and _a.
Further partition into 6 subsets
Having prefix <aa>;
…
Having prefix <af>
2017年5月5日星期五
Data Mining: Concepts and Techniques
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
33
2. Clustering
k-means
Birch (based on CF-tree)
DBSCAN
CURE
2017年5月5日星期五
Data Mining: Concepts and Techniques
34
The K-Means Clustering Method
Pick k objects as initial seed points
Assign each object to the cluster with the nearest
seed point
Re-compute each seed point as the centroid (or
mean point) of its cluster
Go back to Step 2, stop when no more new
assignment
Not optimal. A counter example?
2017年5月5日星期五
Data Mining: Concepts and Techniques
35
BIRCH (1996)
Balanced Iterative Reducing and
Clustering using Hierarchies
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan
Weakness: handles only numeric data, and sensitive to the
and improves the quality with a few additional scans
order of the data record.
Data Mining: Concepts and Techniques
2017年5月5日星期五
36
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1 Xi
SS: Ni=1 (Xi )2
CF = (5, (16,30),244)
10
9
8
7
6
5
4
3
2
1
0
0
2017年5月5日星期五
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
37
Some Characteristics of CF
Two CF can be aggregated.
Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2),
If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2).
The centroid and radius can both be computed from CF.
centroid is the center of the cluster
radius is the average distance between an object and the centroid.
x
N
x
i 1
0
N
i
2
(
)
i1 xi x0
N
R
N
how?
2017年5月5日星期五
Data Mining: Concepts and Techniques
38
Some Characteristics of CF
LS
x0  N
2
2
((
)
(
)
i1 xi x0  2 xi * x0)
N
R
N
SS  N  ( LS / N ) 2  2 LS * ( LS / N )
N
1
N
2017年5月5日星期五
N * SS  LS 2
Data Mining: Concepts and Techniques
39
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: specify the maximum number of children.
threshold T: max radius of sub-clusters stored at the leaf nodes
2017年5月5日星期五
Data Mining: Concepts and Techniques
40
Insertion in a CF-Tree
To insert an object o to a CF-tree, insert to the root node of the CF-tree.
To insert o into an index node, insert into the child node whose
centroid is the closest to o.
To insert o into a leaf node,
If an existing leaf entry can “absorb” it (i.e. new radius <= T), let it
be;
Otherwise, create a new leaf entry.
Split:
Choose two entries whose centroids are the farthest away;
Assign them to two different groups;
Assign the remaining entries to one of these groups.
2017年5月5日星期五
Data Mining: Concepts and Techniques
41
Density-Based Clustering: Background (II)
Density-reachable:
p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected
A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
2017年5月5日星期五
p
Data Mining: Concepts and Techniques
q
o
42
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
2017年5月5日星期五
MinPts = 5
Data Mining: Concepts and Techniques
43
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
2017年5月5日星期五
Data Mining: Concepts and Techniques
44
Motivation for CURE
k-means does not perform well on this;
AGNES + dmin has single-link effect!
2017年5月5日星期五
Data Mining: Concepts and Techniques
45
Cure: The Basic Version
Initially, insert to PQ every object as a cluster.
Every cluster in PQ has:
(Up to) C representative points
Pointer to closest cluster (dist between two clusters
= min{dist(rep1, rep2)}.
While PQ has more than k clusters,
Merge the top cluster with its closest cluster.
2017年5月5日星期五
Data Mining: Concepts and Techniques
46
Representative points
Step 1: choose up to C points.
If a cluster has no more than C points, all of them.
Otherwise, choose the first point as the farthest from
the mean. Choose the others as the farthest from the
chosen ones.
Step 2: shrink each point towards mean:
p’ = p +  * (mean – p)
[0,1]. Larger  means shrinking more.
Reason for shrink: avoid outlier, as faraway objects
are shrunk more.
2017年5月5日星期五
Data Mining: Concepts and Techniques
47
3. Classification
decision tree
naïve Baysian classifier
Baysian net
neural net and SVM
2017年5月5日星期五
Data Mining: Concepts and Techniques
48
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
2017年5月5日星期五
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
49
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
2017年5月5日星期五
Data Mining: Concepts and Techniques
50
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
2017年5月5日星期五
Data Mining: Concepts and Techniques
51
General Case
Suppose X can have one of m values… V1, V2,
P(X=V1) = p1
P(X=V2) = p2
….
…
Vm
P(X=Vm) = pm
What’s the smallest possible number of bits, on average, per
symbol, needed to transmit a stream of symbols drawn from
X’s distribution? It’s
H ( X )   p1 log 2 p1  p2 log 2 p2    pm log 2 pm
m
  p j log 2 p j
j 1
H(X) = The entropy of X
“High Entropy” means X is from a uniform (boring) distribution
“Low Entropy” means X is from varied (peaks and valleys) distribution
2017年5月5日星期五
Data Mining: Concepts and Techniques
52
Specific Conditional Entropy
X = College Major
Definition of Conditional Entropy:
Y = Likes “Gladiator”
H(Y|X=v) = The entropy of Y among
only those records in which X has
value v
X
Y
Example:
Math
Yes
History
No
CS
Yes
• H(Y|X=Math) = 1
Math
No
• H(Y|X=History) = 0
Math
No
• H(Y|X=CS) = 0
CS
Yes
History
No
Math
Yes
2017年5月5日星期五
Data Mining: Concepts and Techniques
53
Conditional Entropy
Definition of general Conditional
Y = Likes “Gladiator” Entropy:
X = College Major
H(Y|X) = The average conditional
entropy of Y
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
2017年5月5日星期五
= ΣjProb(X=vj) H(Y | X = vj)
Example:
vj
Math
History
CS
Prob(X=vj)
0.5
0.25
0.25
H(Y | X = vj)
1
0
0
H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
Data Mining: Concepts and Techniques
54
Conditional entropy H(C|age)
age buy no
<=30
2
3
30…40 4
0
>40
3
2
H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971
H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0
H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) = 0.971
H (C | age) 
5
4
H (C | age  30) 
H (C | age  (30,40])
14
14
5
H (C | age  40)  0.694
14
2017年5月5日星期五
Data Mining: Concepts and Techniques
55
Select the attribute with lowest
conditional entropy
H(C|age) = 0.694
H(C|income) = 0.911
H(C|student) = 0.789
H(C|credit_rating) = 0.892
age?
<=30 30..40
Select “age” to be the
tree root!
student? yes
2017年5月5日星期五
no
yes
no
yes
Data Mining: Concepts and Techniques
>40
credit rating?
excellent fair
no
yes
56
Bayesian Classification
X: a data sample whose class label is unknown, e.g.
X =(Income=medium, Credit_rating=Fair, Age=40).
Hi: a hypothesis that a record belongs to class Ci, e.g.
Hi = a record belongs to the “buy computer” class.
P(Hi), P(X): probabilities.
P(Hi/X): a conditional probability: among all records with
medium income and fair credit rating, what’s the
probability to buy a computer?
This is what we need for classification! Given X, P(Hi/X)
tells us the possibility that it belongs to some class.
What if we need to determine a single class for X?
2017年5月5日星期五
Data Mining: Concepts and Techniques
57
Bayesian Theorem
Another concept, P(X|Hi) : probability of observing the
sample X, given that the hypothesis holds. E.g. among all
people who buy computer, what percentage has the same
value as X.
We know P(X  Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi),
So
P( X | H )P(H )
P(H | X ) 
i
i
P( X )
i
We should assign X to the class Ci where P(Hi|X) is
maximized,
 equivalent to maximize P(X|Hi) P(Hi).
2017年5月5日星期五
Data Mining: Concepts and Techniques
58
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x | C i)
k
k 1
The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the number of probabilities to maintain.
2017年5月5日星期五
Data Mining: Concepts and Techniques
59
Sample quiz questions
1. What data does
naïve Baysian net
maintain?
2. Given
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=Fair)
buy or not buy?
2017年5月5日星期五
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
60
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
2017年5月5日星期五
Pitfall: forget P(Ci)
Data Mining: Concepts and Techniques
61
Assume five variables
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
T only directly influenced by L (i.e. T is conditionally
independent of R,M,S given L)
L only directly influenced by M and S (i.e. L is
conditionally independent of R given M & S)
R only directly influenced by M (i.e. R is conditionally
independent of L,S, given M)
M and S are independent
2017年5月5日星期五
Data Mining: Concepts and Techniques
62
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
M
S
L
R
T
Step One: add variables.
• Just choose the variables you’d like to be included in the
net.
2017年5月5日星期五
Data Mining: Concepts and Techniques
63
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
M
S
L
R
T
Step Two: add links.
• The link structure must be acyclic.
• If node X is given parents Q1,Q2,..Qn you are promising
that any variable that’s a non-descendent of X is
conditionally independent of X given {Q1,Q2,..Qn}
2017年5月5日星期五
Data Mining: Concepts and Techniques
64
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
Step Three: add a probability table for each node.
• The table for node X must list P(X|Parent Values) for each
possible combination of parent values
2017年5月5日星期五
Data Mining: Concepts and Techniques
65
Computing with Bayes Net
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
P(T
P(T
P(T
P(T
P(T
P(T
P(T
P(T
^ ~R ^ L ^ ~M ^ S) =
 ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) =
 L) * P(~R ^ L ^ ~M ^ S) =
 L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) =
 L) * P(~R  ~M) * P(L^~M^S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M^S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M | S)*P(S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M)*P(S).
2017年5月5日星期五
Data Mining: Concepts and Techniques
66
What we learned?
4. Data warehousing
concept, schema
data cube & operations (rollup, …)
cube computation: multi-way array aggregation
iceberg cube
dynamic data cube
2017年5月5日星期五
Data Mining: Concepts and Techniques
67
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
2017年5月5日星期五
Data Mining: Concepts and Techniques
68
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
2017年5月5日星期五
Data Mining: Concepts and Techniques
69
A data cube
all
0-D(apex) cuboid
product
product, quarter
quarter
country
product,country
1-D cuboids
quarter, country
2-D cuboids
product, quarter, country
2017年5月5日星期五
Data Mining: Concepts and Techniques
3-D(base) cuboid
70
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
Office
Month
2017年5月5日星期五
City
Month Week
Day
Pick one node from each dimension
hierarchy, you get a data cube!
How many cubes? How many distinct cuboids?
Data Mining: Concepts and Techniques
71
Typical OLAP Operations
Roll up (drill-up): summarize data
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
by climbing up hierarchy or by dimension reduction
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
2017年5月5日星期五
Data Mining: Concepts and Techniques
72
Typical OLAP Operations
Industry Region
Year
Category Country Quarter
Product
City
Office
Month Week
Day
?? Starting from [product, city, week], what OLAP operations can
produce the total sales for every month and every category in the
“automobile” industry.
2017年5月5日星期五
Data Mining: Concepts and Techniques
73
OLAP Server Architectures
Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 greater scalability
Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine (sparse matrix
techniques)
 fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
 specialized support for SQL queries over star/snowflake schemas
2017年5月5日星期五
Data Mining: Concepts and Techniques
74
Multi-way Array Aggregation for
Cube Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
2017年5月5日星期五
60
16
Data Mining: Concepts and Techniques
52
20
Order: ABC
AB: plane
AC: line
BC: point
75
Multi-Way Array Aggregation for
Cube Computation (Cont.)
Let A: 40 values, B: 400 values, C: 4000 values.
One chunk contains 10*100*1000 = 1,000,000 values.
ABC needs how much memory?
AB plane: 40*400=16,000
AC line: 40*(4000/4) = 40,000
BC point: (400/4)*(4000/4) = 100,000
total: 156,000
CBA needs how much memory?
CB plane: 4000*400=1,600,000
CA line: 4000*(40/4) = 40,000
BA point: (400/4)*(40/4) = 1000
total: 1,641,000 --- 10 times more!
2017年5月5日星期五
Data Mining: Concepts and Techniques
76
Computing iceberg cube using BUC
BUC (Beyer & Ramakrishnan, SIGMOD’99)
Bottom-up vs. top-down?—depending on how you view it!
Apriori property:
 Aggregate the data, then move to the next level
 If minsup is not met, stop!
2017年5月5日星期五
Data Mining: Concepts and Techniques
77
The Dynamic Data Cube [EDBT’00]
1..4
1..4
5..8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
41 81 121 16
1
5..8
1
1
1
1
1
1
1
1
1
1
1
1
4 8 12 16
4 1
8 1
12 1
16 1
1
1
1
1
1
1
1
1
1
1
1
1
4 81 121 16
1
4 1
8 1
12 1
16 1
1
1
1
1
1
1
1
1
1
4 8 12 16
4
8
12
16
4
8
12
16
E.g. 16+12+8+6 = 42.
Query cost = update cost = O(log2(n))
2017年5月5日星期五
Data Mining: Concepts and Techniques
78
Dynamic Data Cube summary
A balanced tree with fanout=4.
The leaf nodes contains the original data cube.
Each index entry stores an X-border and an Y-border.
Each border is stored as a binary tree, which supports a
1-dim prefix-sum query and an update in O(log(n)) time.
Overall, the DDC supports a range-sum query and an
update both in O(log2n) time.
2017年5月5日星期五
Data Mining: Concepts and Techniques
79
5. Additional
lattice (of itemsets, g-itemsets, rules, cuboids)
distance-based indexing
2017年5月5日星期五
Data Mining: Concepts and Techniques
80
Problem Statement
Given a set S of objects and a metric distance
function d(). The similarity search problem is
defined as: for an arbitrary object q and a
threshold , find
{ o | oS  d(o, q)< }
Solution without index: for every oS, compute
d(q,o). Not efficient!
2017年5月5日星期五
Data Mining: Concepts and Techniques
81
An Example of the VP-tree
S={o1,…,o10}.
Randomly pick o1 as root.
Compute the distance between o1 and oi, sort in
increasing order of distance:
o3
o7
o6
o9
o10
o2
o8
o5
o4
5
6
18
34
96
102
111
300
401
build tree recursively.
34
o3 , o7 , o6 , o9
2017年5月5日星期五
o1
96
o10 , o2 , o8 , o5, o4
Data Mining: Concepts and Techniques
82
Query Processing
Given object q, compute d(q,root). Intuitively, if it’s small,
search the left tree; otherwise, search the right tree.
In each index node, store:
 maxDL=max{ d(root, oi)|oi left tree },
 minDR=min{ d(root, oi)|oi right tree }.
Pruning condition:
 prune left if: d(q,root) – maxDL ≥ 
 prune right if: minDR - d(q,root) ≥ 
?? maxDL=10, minDR=20, d(q,root)=10, =10. Which
sub-tree(s) do we check?
?? maxDL=10, minDR=20, d(q,root)=10, for what  do
we have to check both trees?
2017年5月5日星期五
Data Mining: Concepts and Techniques
83
Summary
1.
Frequent pattern & association
2.
Clustering
3.
Classification
4.
Data warehousing
5.
Additional
2017年5月5日星期五
Data Mining: Concepts and Techniques
84