Download Intro PDB - University of Louisiana at Lafayette

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Techniques of
Classification and
Clustering
Advanced Database Systems, mod1-3, 2006
1
Problem Description
Assume
 A={A1, A2, …, Ad}: (ordered or unordered) domain
 S= A1  A2  … Ad : d-dimensional (numerical or
non-numerical) space
Input
 V={v1, v2, …, vm}: d-dimensional points, where vi =
vi1, vi2, …, vid.
 The jth component of vi is drawn from domain Aj.
Output
 G={g1, g2, …, gk}: a set groups of V with label vL,
where gi  V.
Advanced Database Systems, mod1-3, 2006
2
Classification
Supervised classification
 Discriminant analysis, or simply Classification
 A collection of labeled (pre-classified) patterns are
provided
 Aims to label a newly encountered, yet unlabeled
(training) patterns
Unsupervised classification
 Clustering
 Aims to group a given collection of unlabeled
patterns into meaningful clusters
 Category labels are data driven
Advanced Database Systems, mod1-3, 2006
3
Methods for Classification
Neural Nets
 Classification functions are obtained by passing
multiple passes over training sets
 Poor generation efficiency
 Not efficient handling of non-numerical data
Decision trees
 If E contains only objects of one group, the
decision tree is just a leaf labeled with that group.
 Construct a DT that correctly classifies objects in
the training data set.
 Test to classify the unseen objects in the test data
set.
Advanced Database Systems, mod1-3, 2006
4
Decision Trees (Ex: Credit Analysis)
s a la ry
e d u c a t io n
la b e l
10000
h ig h s c h o o l
re je c t
40000
u n d e r g ra d u a t e
ac c ept
15000
u n d e r g ra d u a t e
re je c t
75000
g ra d u a t e
ac c ept
18000
g ra d u a t e
ac c ept
salary < 20000
yes
no
education in graduate
yes
accept
Advanced Database Systems, mod1-3, 2006
accept
no
reject
5
Decision Trees
 Pros
 Fast execution time
 Generated rules are easy to interpret by humans
 Scale well for large data sets
 Can handle high dimensional data
 Cons
 Cannot capture correlations among attributes
 Consider only axis-parallel cuts
Advanced Database Systems, mod1-3, 2006
6
Decision Tree Algorithms
 Classifiers from machine learning community:
 ID3[J. R. Quinlan, Induction of decision trees, Machine Learning, 1, 1986 ]
 C4.5[J. Ross Quinlan, C4.5: Programs for and Neural Networks, Cambridge University
.
Press, Cambridge, 1996. Machine Learning, Morgan Kaufman, 1993]
 CART[L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and
Regression Trees, Wadsworth, Belmont, 1984.]
 Classifiers for large database:
 SLIQ[MAR96], SPRINT[John Shafer, Rakesh Agrawal, and Manish Mehta,
SPRINT: A scalable parallel classifier for data mining, the VLDB Conference, Bombay,
India, September 1996.
]
 SONAR[Takeshi Fukuda, Yasuhiko Morimoto, and Shinichi Morishita, Constructing
efficient decision trees by using optimized numeric association rules, the VLDB
Conference, Bombay, India, 1996.]
 Rainforest[J. Gehrke, R. Ramakrishnan, V. Ganti, RainForest – A Framework for Fast
Decision Tree Construction of Large Datasets, Proc. of VLDB Conf., 1998.]
 Pruning phase followed by building phase
Advanced Database Systems, mod1-3, 2006
7
Decision Tree Algorithms
 Building phase
 Recursively split nodes using best splitting
attribute for node
 Pruning phase
 Smaller imperfect decision tree generally achieves
better accuracy
 Prune leaf nodes recursively to prevent over-fitting
Advanced Database Systems, mod1-3, 2006
8
Preliminaries
Theoretic Background
 Entropy
Similarity measures
Advanced terms
Advanced Database Systems, mod1-3, 2006
9
Information Theory Concepts
 Entropy of a random variable X with probability distribution
p(x):
H ( p)   p( x) log p( x)
x
 The Kullback-Leibler(KL) Divergence or “Relative Entropy”
between two probability distributions p and q:
KL( p, q)   p( x) log( p( x) q( x))
x
 Mutual Information between random variables X and Y:
 p( x, y ) 

I ( X , Y )   p( x, y ) log 
x
y
 p ( x) p( y ) 
Advanced Database Systems, mod1-3, 2006
10
What is Entropy
 S is a sample of “training data set”
 Entropy measures the impurity of S
H  X    p1 log 2 p1  p2 log 2 p2    pm log 2 pm
m
  p j log 2 p j
j 1
ek
ek
E  S    log 2
e
k e
H(X)=The entropy of X
-If H(X)=0, it means X is one value; As H() increases, X are heterogeneous values.
-For the same number of X values,
-“Low Entropy” means X is from a uniform (boring) distribution: A
histogram of the frequency distribution of values of X would be flat;
and so the values sampled from it would be all over the place
-“High Entropy” means X is from varied (peaks and valleys)
distribution: A histogram of the frequency distribution of values of X
would have many lows and one or two highs;  and so the values
sampled from it would be more predictable.
Advanced Database Systems, mod1-3, 2006
11
T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama, Constructing Efficient Decision
Trees by Using Optimized Numeric Association Rules, Proc. of VLDB Conf., 1996.
Entropy-Based Data Segmentation
100
C1
C2
C3
40
30
30
Attribute has three categories,
40 C1, 30 C2, 30 C3.
ek
ek
E  S    log 2
e
k e
40
40 30
30 30
30

log

log

log
100
100 100
100 100
100
 1.09
S1
C1
C2
C3
S2
C1
C2
C3
60
40
10
10
40
0
20
20
S3
C1
C2
C3
S4
C1
C2
C3
60
20
20
20
40
20
10
10
Advanced Database Systems, mod1-3, 2006
Splitting
nj
ni
E  Si ; S j   E  Si   E  S j 
n
n
E  S1 ; S2   0.80
E  S3 ; S 4   1.075
12
R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami, An Interval Classifier for Database Mining Applications, Proc. ofVLDB, 1992.
Information Theoretic Measure
Information gain by branching on Ai :
gain(Ai) = E - Ei
 The entropy E of an object set:
the object set containing
object ek of group Gk.
 The expected entropy for
the tree with Ai as the root
ek
ek
E   log 2
e
k e
eij j
Ei   Ei
j e
where Eij is the expected entropy for the subtree of an
object set.
Information content of
eij
eij
I  Ai    log 2
e
the value of Ai
j e
Advanced Database Systems, mod1-3, 2006
13
ek
e
log 2 k
e
k e
40
40 30
30 30
30

log

log

log
100
100 100
100 100
100
 1.09
E  S   
Ex
100
C1
C2
C3
40
30
30
S1
C1
C2
C3
S2
C1
C2
C3
60
20
20
20
40
20
10
10
S3
C1
C2
C3
S4
C1
C2
C3
40
40
0
0
30
0
30
0
S5
C1
C2
C3
30
0
0
30
Gain
gain(Ai) = E - Ei
Advanced Database Systems, mod1-3, 2006
Splitting
60 1
40 2
E1 ( S1; S 2) 
E1 ( S1) 
E1 ( S 2)  1.075
100
100
E2 ( S 3; S 4; S 5)  0
gain1=E-E1=0.015
gain2=E-E2=1.09
14
Distributional Similarity Measures
 Cosine
 Jaccard coefficient
 Dice coefficient
 Overlap coefficient
 L1 distance (City block distance)
 Euclidean distance (L2 distance)
 Hellinger distance
 Information Radius (Jensen-Shannon divergence)
 Skew divergence
 Confusion Probability
 Lin’s Similarity Measure
Advanced Database Systems, mod1-3, 2006
15
Similarity Measures
 Minkowski distance
1
p
p
 d
d p  xi , x j     xi ,k  x j ,k   xi  x j p
 k 1

 Euclidean distance
2
2
2
 p=2:
 x11  x21    y12  y22   ...   y1d  y2d 
 Manhattan distance
 p=1:
x11  x21  y12  y22  ...  y1d  y2d
 Mahalanobis distance
d M  xi , x j    xi  x j  
1
x  x 
T
i
 Normalization due to weight schemes
  is the sample covariance matrix of the patterns or the
j
known covariance matrix of the pattern generation process;
Advanced Database Systems, mod1-3, 2006
16
General form
I (common (A,B))
IT-Sim (A,B) =
I (description (A,B))
 I (common (A,B)): information content associated with
the statement describing what A and B have in
common
 I (description (A,B)): information content associated
with the statement describing A and B
 (s): probability of the statement within the world of
the objects in question, i.e., fraction of objects exhibiting
feature s.
2sA B log  s 
IT-Sim (A,B) =
sA log  s   sB log  s 
Advanced Database Systems, mod1-3, 2006
17
Similarity Measures
The Set/Bag Model: Let X and Y be two
collections of XML documents
 Jaccard’s Coefficient
simJacc  X , Y  
X Y
X Y
 Dice’s Coefficient
simDice  X , Y  
Advanced Database Systems, mod1-3, 2006
2* X  Y
X Y
18
t3
Similarity Measures
d
2
θ
Cosine-Similarity Measure (CSM)
 
d j  dk
Sim d j , d k     
d j dk
n
w
i 1
ij
d1
t1
t2
 wik
 w    w 
n
i 1
2
ij
n
i 1
2
ik
The Vector-Space Model: Cosine-Similarity
Measure (CSM)
SimCos  X , Y  
Advanced Database Systems, mod1-3, 2006
X Y
XY
19
Query Processing: a single cosine
For every term i, with each doc j, store term
frequency tfij.
 Some tradeoffs on whether to store term count,
term weight, or weighted by idfi.
At query time, accumulate component-wise
sum
 
m
sim(d j , dk )   wi, j  w
i
,
k
i1
If you’re indexing 5 billion documents (web
search) an array of accumulators is infeasible
Ideas?
Advanced Database Systems, mod1-3, 2006
20
Similarity Measures (2)
The Generalized Cosine-Similarity Measure
(GCSM): Let X and Y be vectors X   i xi liand
Y   i yi li
n
where
Hierarchical Model
simGCSM  X , Y  
n
X Y   xi y j li l j
i 1 j 1
X Y
X X YY
 Why only for depth?
li l j 
2* depth( LCAU  li , l j )
depth(li )  depth(l j )
Advanced Database Systems, mod1-3, 2006
21
f d , t   log
wd , t  
2 Dim Similarities
N
n(t )

N 


  f d , t  log n(t ) 
t
wd , t  
2
f d , t   log

t
N
n(t )
N
f d , t   log
n(t )
Cosine Measure simd1 , d 2    wd1 , t   wd 2 , t 
Hellinger Measure sim d1 , d 2    wd1 , t   wd 2 , t 
t
t
Tanimoto Measure
sim d1 , d 2  
 wd , t  wd , t 
1
 wd , t   wd , t   wd , t  wd , t 
2
1
Clarity Measure
2
t
2
2
1
2
t
  KLwt , d1  || wt , d 2   KLwt , d1  || GE 
sim d1 , d 2   KLwt , d 2  || wt , d1   KLwt , d 2  || GE 

Advanced Database Systems, mod1-3, 2006
22
Advanced Terms
Conditional Entropy
Information Gain
Advanced Database Systems, mod1-3, 2006
23
Specific Conditional Entropy
 H(Y|X=v)
 Suppose I’m trying to predict output Y and I have
input X
 X=College Major, Y= likes “Gladiator”
 Let’s assume this reflects the true probabilities
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
Advanced Database Systems, mod1-3, 2006
From this data we estimate:
-P(LikeG=Yes)=0.5
-P(Major=Math & LikeG=No) = 0.25
-P(Major=Math)=0.5
-P(LikeG=Yes |Major=Hisgory)=0
Note:
-H(X)=1.5; -H(Y)=1
---H(Y|X=Math)=1; H(Y|X=History)=0;
H(Y|X=CS)=0;
24
Conditional Entropy
 Definition of Conditional Entropy



X
H(Y|X)=The average specific conditional entropy of Y
If you choose a record at random what will be the conditional
entropy of Y, conditioned on that row’s value of X
Expected number of bits to transmit Y if both sides will know the
m
value of X
  Pr ob( X  v j ) H (Y | X  v j )
Y
j 1
vj
Prob(X=vj)
H(Y|X=vj)
Math
Yes
History
No
Math
0.5
1
CS
Yes
History
0.25
0
Math
No
CS
0.25
0
Math
No
CS
Yes
History
No
Math
Yes
Advanced Database Systems, mod1-3, 2006
H(Y|X)=0.5*1+0.25*0+0.25*0=0.5
25
Information Gain
 Definition of Information Gain
 IG(Y|X) = I must transmit Y. How many bits on average

would it save me if both ends the line knew X?
IG(Y|X) = H(Y) – H(Y|X)
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
H(Y) = 1
H(Y|X) = 0.5*1+0.25*0+0.25*0=0.5
Thus, IG(Y|X) = 1-0.5 = 0.5
Advanced Database Systems, mod1-3, 2006
26
Relative Information Gain
 Definition of Relative Information Gain
 RIG(Y|X) = I must transmit Y, what fraction of the bits on

average would it save me if both ends the line knew X?
RIG(Y|X) = [H(Y) – H(Y|X)]/H(Y)
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
H(Y) = 1
H(Y|X) = 0.5*1+0.25*0+0.25*0=0.5
Thus, IG(Y|X) = (1-0.5)/1 = 0.5
Advanced Database Systems, mod1-3, 2006
27
What is Information Gain used for?
Suppose you are trying to predict whether
someone is going to live past 80 years. From
historical data you might find…
 IG(LongLife | HairColor) = 0.01
 IG(LongLife | Smoker) = 0.2
 IG(LongLife | Gender) = 0.25
 IG(LongLife | LastDigitOfSSN) = 0.00001
IG tells you how interesting 1 2-d contingency
table is going to be.
Advanced Database Systems, mod1-3, 2006
28
Clustering
 Given:
 Data points and number of desired clusters K
 Group the data points into K clusters
 Data points within clusters are more similar than
across clusters
 Sample applications:
 Customer segmentation
 Market basket customer analysis
 Attached mailing in direct marketing
 Clustering companies with similar growth
Advanced Database Systems, mod1-3, 2006
29
A Clustering Example
Income: High
Children:1
Car:Luxury
Cluster 1
Income: Low
Children:0
Car:Compact
Income: Medium
Children:2
Car: Sedan and Car:Truck
Children:3
Income: Medium
Cluster 4
Cluster 3
Cluster 2
Advanced Database Systems, mod1-3, 2006
30
Different ways of representing
clusters
(b)
(a)
d
a
j
k
g
(c)
a
b
c
...
e
i
1
0.4
0.1
0.3
e
a
c
h
d
j
k
b
f
c
h
i
f
b
g
2
3
0.1 0.5
0.8 0.1
0.3 0.4
Advanced Database Systems, mod1-3, 2006
(d)
g
a c i e d kb j f h
31
Clustering Methods
 Partitioning
 Given a set of objects and a clustering criterion, partitional

clustering obtains a partition of the objects into clusters such
that the objects in a cluster are more similar to each other
than to objects in different clusters.
K-means, and K-mediod methods determine K cluster
representatives and assign each object to the cluster with its
representative closest to the object such that the sum of the
distances squared between the objects and their
representatives is minimized.
 Hierarchical
 Nested sequence of partitions.
 Agglomerative: starts by placing each object in its own

cluster and then merge these atomic clusters into larger and
larger clusters until all objects are in a single cluster.
Divisive: starts with all objects in cluster and subdividing
into smaller pieces.
Advanced Database Systems, mod1-3, 2006
32
Algorithms
 k-Means
Fuzzy C-Means Clustering
Hierarchical Clustering
Probabilistic Clustering
Advanced Database Systems, mod1-3, 2006
33
Similarity Measures (2)
Mutual Neighbor Distance (MND)
 MND(xi, xj) = NN(xi, xj)+NN(xj, xi), where NN(xi,
xj) is the neighbor number xj with respect to xi.
Distance under context
 s(xi, xj)=f(xi, xj, e), where e is the context
x2
x2
C
C
B
B
A
x1
Advanced Database Systems, mod1-3, 2006
D A
FE
x1
34
K-Means Clustering Algorithm
1. Choose k cluster centers to coincide with k
randomly-chosen patterns
2. Assign each pattern to its closest cluster
center.
3. Recompute the cluster centers using the
current cluster memberships.
4. If a convergence criterion is not met, go to
step 2.
Typical convergence criteria:
No (or minimal) reassignment of patterns to new cluster
centers, or minimal decrease in squared error.
Advanced Database Systems, mod1-3, 2006
35
Objective Function
 k-Means algorithm aims at minimizing the
following objective function: (square error
function)
k
n
J   x
j 1 i 1
Advanced Database Systems, mod1-3, 2006
( j)
i
 cj
2
36
K-Means Algorithm (Ex)
G
F
E
D
I
A
H
C
B
Advanced Database Systems, mod1-3, 2006
J
37
Distortion
Given a clustering , we denote by (x) the
centroid this clustering associates with an
arbitrary point x. A measure of quality for  :
Distortion = x d2(x, (x))/R
 Where R is the total number of points and x
ranges over all input points.
Improvement
Distortion + (# parameters) log R
= Distortion +  mk log R
Advanced Database Systems, mod1-3, 2006
38
Remarks
The way to initialize the means is the
problem. One popular way to start is to
randomly choose k of the samples
The results produced depend on the initial
values for the means
It can happen that the set of samples closest
to mi is empty, so the mi cannot be updated.
The results depend on the metric used to
measure x  mi
Advanced Database Systems, mod1-3, 2006
39
Related Work [Clustering]
 Graph-based clustering []
 For an XML document collection C, s-Graph sg (C) = (N, E),

a directed graph such that N is the set of all the elements and
attributes in the documents in C and (a, b)  E if and only if
a is a parent element of b in document(s) in C (b can be
element or attribute).
For two sets, C1 and C2, of XML documents, the distance
between them, where |sg(Ci)| is the number of edges
| sg (C1)  sg (C 2) |
dist (C1, C 2)  1 
Max{| sg (C1) |, | sg (C 2) |}
Advanced Database Systems, mod1-3, 2006
40
Fuzzy C-Means Clustering
FCM is a method of clustering which allows
one piece of data to belong to two or more
N C
clusters.
2
m
J   uij xi  c j
i 1 j 1
Fuzzy partitioning is carried out through an
iterative optimization of the objective
function shown above, with the update of
membership u and the cluster center c by:
N
uij 
1
 xi  c j


 xi  ck
k 1

C
Advanced Database Systems, mod1-3, 2006




2
m 1
, cj 
m
u
 ij  xi
i 1
N
m
u
 ij
i 1
41
j cluster 1
i data itum
1 1
N
m
u
2 0

ij  xi
1
.3
i 1
uij 
,
c

3
j
2
N
m
C  x  c  m 1
u
4

ij
j 
 i
i 1

5


x c
k 1
 i k 
..
Membership
2 3 4 5 ….
0 0 0 0 0 0
0 1 0 0 0 0
.2 0 .5 0 0 0
The iteration stop when   ) kj(iu  )1 kj(iu ji xam,
where  is a termination criterion between 0
and 1, whereas k are the iteration steps. This
procedure converges to a local minimum or a
saddle point of Jm.
Advanced Database Systems, mod1-3, 2006
42
Fuzzy Clustering
membership
membership
1
1
0 . . …. . . . . . . …. . .
Cluster
C1
C2
0 . . …. . . . . . . …. . .
Cluster
C1
C2
Properties



uij  [0,1] for all i,j
C
for all i
uij  1

N
j 1
0   uij  N for all N
i 1
Advanced Database Systems, mod1-3, 2006
43
Speculations
Correlation between m and :
 More iteration k for less .
Advanced Database Systems, mod1-3, 2006
44
Hierarchical Clustering
 Basic Process
1. Start by assigning each item to a cluster. N
clusters for N items. (Let the distances between
the clusters the same as the distances between
the items they contain.)
2. Find the closest (most similar) pair of clusters
and merge them into a single cluster.
3. Compute distances between the new cluster and
each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered
intoa single cluster of size N.
Advanced Database Systems, mod1-3, 2006
45
Hierarchical Clustering (Ex)
F
similarity
C
B
G
D
E
A
A B C D E F G
dendrogram
Advanced Database Systems, mod1-3, 2006
46
Hierarchical Clustering Algorithms
Single-linkage clustering
 The distance between two clusters is the minimum
of the distances between all pairs of patterns
drawn from the two clusters (one pattern from the
first cluster, the other from the second).
Complete-linkage clustering
 The distance between two clusters is the
maximum of the distances between all pairs of
patterns drawn from the two clusters
Average-linkage clustering
Minimum-variance algorithm
Advanced Database Systems, mod1-3, 2006
47
Single-/Complete-Link Clustering
1
2
1 1 11
2
22222
1
2 2
1 1 1*********2
22 22
1 1 1
2
1 1
222 2
1 1 1
2 22 2
1 1
Advanced Database Systems, mod1-3, 2006
1
2
1 1 11
2
22222
1
2 2
1 1 1*********2
22 22
1 1 1
2
1 1
222 2
1 1 1
2 22 2
1 1
48
Single Linkage Hierarchical Cluster
 Steps:
1.
2.
3.
4.
5.
Begin with the disjoint clustering having level L(k)=0 and
sequence number m=0.
Find the least dissimilar pair of clusters in the current
clustering, d[(r),(s)]= min d[(i),(j)], where the minimum is
over all pairs of clusters in the current clustering.
Increment the sequence number m=m+1. Merge clusters (r)
and (s) into a single cluster to form the next clustering m.
Set L(m) = d[(r),(s)].
Update the proximity matrix, D, by deleting the rows and
columns corresponding to clusters (r) and (s) and adding a
row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted
(r,s) and old cluster (k) is defined: d[(k),(r,s)]= min
(d[(s),(r)], d[(k),(s)]).
If all objects are in one cluster, stop. Else go to step 2.
Advanced Database Systems, mod1-3, 2006
49
Ex: Single-Linkage
Cities  States
0
Advanced Database Systems, mod1-3, 2006
50
Agglomerative Hierarchical d b , b   maxb , b 
Clustering
bi XOR b j
i
j
i
j
ALGORITHM Agglomerative Hierarchical Clustering
INPUT: bit-vectors B in bitmap index BI
OUTPUT: a tree T
METHOD:
(1) Place each bit-vector Bi in its cluster (singleton), creating the list of
clusters L (initially, the leaves of T): L=B1, B2, …, Bn.


(2) Compute a merging cost function, d Bi , B j 
min
bi Bi ,b j B j
d bi , b j 
between every pair of elements in L to find the two closest clusters
{Bi,Bj} which will be the cheapest couple to merge.
(3) Remove Bi and Bj from L.
(4) Merge Bi and Bj to create a new internal node Bjj in T which will be
the parent of Bi and Bj in the result tree.
(5) Repeat from (2) until there is only one set remaining.
Advanced Database Systems, mod1-3, 2006
51
Graph-Theoretic Clustering
Construct the minimal spanning tree (MST)
Delete the MST edges with the largest lengths
x2
B
0.5
6.5
F 1.5
D
A
3.5
C
1.5
G
1.7
E
x1
Advanced Database Systems, mod1-3, 2006
52
Improving k-Means
D. Pelleg and A. Moore, Accelerating Exact k-means Algorithms with Geometric Reasoning,
ACM Proceedings of Conf. on Knowledge and Data Mining, 1999.
Definitions
 Center of clusters  (Th2) Center of rectangle;
owner(h)
 c1 dominates c2 w.r.t. h  if h is in the same side
as c1 wrt c2. (pg.7,9)
Update Centroid
 If for all other centers c’, c dominates c’ wrt h (so
c=owner(h), pg 10)  insert into owner(h); or split
h
 (blacklist version) c1 dominates c2 wrt h’ for any
h’ contained in h. (pg.11)
Advanced Database Systems, mod1-3, 2006
53
Clustering Categorical Data: ROCK
S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE
Conf Data Engineering, 1999
 Use links to measure similarity/proximity
 Not distance based
 Computational complexity: O(n2  nmmma  n2 log n)
Basic ideas:
 Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Sim(T1, T 2) 
T1  T2
Sim(T1 , T2 ) 
T1  T2
{3}
1
  0.2
{1,2,3,4,5} 5
Advanced Database Systems, mod1-3, 2006
54
Using Jaccard Coefficient
<1,2,3,4,5>
CLUSTER 1
{1,2,3} {1,4,5}
{1,2,4} {2,3,4}
{1,2,5} {2,3,5}
{1,3,4} {2,4,5}
{1,3,5} {3,4,5}
<1,2,6,7>
CLUSTER 2
{1,2,6}
{1,2,7}
{1,6,7}
{2,6,7}
Advanced Database Systems, mod1-3, 2006
According to
Jaccard coefficient,
the distance
between {1,2,3} and
{1,2,6} is the same
as the one between
{1,2,3} and {1,2,4},
although the
former is from two
different clusters.
55
ROCK
Inducing LINK: the main problem is local
properties involving only the two points are
considered:
 Neighbor: If two points are similar enough with
each other, they are neighbors
 Link: the link for pair of points is the number of
common neighbors.
Advanced Database Systems, mod1-3, 2006
56
S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE Conf Data Engineering, 1999
Rock: Algorithm
Links: The number of common neighbors for the
two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3}
{1,2,4}
Algorithm
 Draw random sample
 Cluster with links
 Label data in disk
Advanced Database Systems, mod1-3, 2006
57
S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE Conf Data Engineering, 1999
Rock: Algorithm
Criterion function: to maximize link for the k
clusters E  k
link ( P , P )
l
 
i 1 Pq , Pr Ci
q
r
 Ci denotes cluster i of size ni.
{1,2,3}
{1,2,4}
{1,2,5}
{1,3,4}
{1,3,5}
{1,4,5}
{2,3,4}
{2,3,5}
{2,4,5}
{3,4,5}
{1,2,6}
{1,2,7}
{1,6,7}
{2,6,7}
Advanced Database Systems, mod1-3, 2006
For the similarity threshold 0.5,
link ({1,2,6}, {1,2,7}) = 4
link ({1,2,6}, {1,2,3}) = 3
link ({1,6,7}, {1,2,3}) = 2
link ({1,2,3}, {1,4,5}) = 3
58
More on Hierarchical Clustering Methods
Major weakness of agglomerative clustering
methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously
Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of
the cluster by a specified fraction
Advanced Database Systems, mod1-3, 2006
59
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
BIRCH
Pre-cluster data points using CF-tree
 For each point
•CF-tree is traversed to find the closest
cluster
•If the threshold criterion is satisfied, the
point is absorbed into the cluster
•Otherwise, it forms a new cluster
Requires only single scan of data
Cluster summaries stored in CF-tree are given
to main memory hierarchical clustering
algorithm
Advanced Database Systems, mod1-3, 2006
60
Initialization of BIRCH
CF of a cluster of n d-dimensional vectors,
V1,…,Vn, is defined as (n,LS, SS)
 n is the number of vectors
 LS is the sum of vectors
 SS is the sum of squares of vectors
CF1+CF2 = (n1 + n1 LS1 + LS1, SS1 + SS1)
 This property is used for incremental maintaining
cluster features
Distance between two clusters CF1 and CF2 is
defined to be the distance between their
centroids.
Advanced Database Systems, mod1-3, 2006
61
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS (linear sum of N data points): Ni=1=Xi
SS (square sum of N data points: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Advanced Database Systems, mod1-3, 2006
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
62
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
Notations
 Given N d-dimensional data points in a cluster: {Xi}
 Centroid X0, radius R, diameter D, controid
Euclidian distance D0, centroid Manhattan distance
D1:

X0 
N
i 1
Xi
D0 
N
1/ 2
  N ( X i  X 0) 2 

R   i 1


N


1/ 2
  N  N ( X i  X j )2 

D   i 1 j 1


N ( N  1)


Advanced Database Systems, mod1-3, 2006
 X 0  X 0  
2 1/ 2
1
2
d
D1  X 01  X 02   X 01(i )  X 0(2 j )
i 1
63
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
Notations (2)
 Given N d-dimensional data points in a cluster: {Xi}
 Average inter-cluster distance D2, average intracluster distance D3, variance increase distance D4:
1/ 2
  N1  N1  N2 ( X i  X j )2 
i 1
j  N1 1

D2  


N1 N 2


D4 
N1  N 2

k 1

Xl
 X k   l 1

N1  N 2

N1  N 2
2
1/ 2
  N1  N2  N1  N2 ( X i  X j )2 
i 1
j  N1 1

D3  
 ( N1  N 2 )( N1  N 2  1) 


 N1 
X
   X i   l 1 l
 i 1 
N1


Advanced Database Systems, mod1-3, 2006
N1
 N1  N2 
Xl

l  N1 1

   Xj 
 j  N1 1 
N2


2
N1  N 2




2
64
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
CF Tree
B=7
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
L=6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Advanced Database Systems, mod1-3, 2006
Leaf node
prev CF1 CF2
CF4 next
65
Example
Given (T=2?), B=3 for “3,” “6,” “8,” and “1”
 {(2,(9, 45)}  {(2,(4,10)), (2,(14,100))}
For “2” inserted (1,(2,4))
 {(3,(6,14),
 {(2,(3,5)), (1,(3,9))}
(2,(14,100))}
{(2,(14,100))}
For “5” inserted (1,(5,25))
 {(3,(6,14),
 {(2,(3,5)), (1,(3,9))}
(3,(19,125))}
{(2,(11,61)), (1,(8,64))}
For “7” inserted  (1,(7,49))
 {(3,(6,14),
 {(2,(3,5)), (1,(3,9))}
Advanced Database Systems, mod1-3, 2006
(4,(26,174))}
{(2,(11,61)), (2,(15,113))}
66
Evaluation of BIRCH
Scales linearly: finds a good clustering with a
single scan and improves the quality with a
few additional scans
Weakness: handles only numeric data and
sensitive to the order of the data record.
Advanced Database Systems, mod1-3, 2006
67
Data Summarization
similarity
To compress the data into suitable
representative objects
OPTICS; Data Bubble
Finding clusters from
hierarchical clustering
– depending on the
“resolution”
A B C D E F G
Advanced Database Systems, mod1-3, 2006
68
OPTICS
M. Ankerst, M. Breunig, H. Kriegel, J. Sander, OPTICS: Ordering Points to Identify the Clustering
Structure, ACM SIGMOD, 1999.
 Pre: N(q): the subset of D contained in the –
neighborhood of q. ( is a radius)

Definition 1: (directly density-reachable) Object p is directly density-reachable from object q
wrt.  and MinPts in a set of objects D if 1) p  N,(q) (N(q) is the subset of D contained in
the -neighborhood of q.) 2) Card(N(q)) >= MinPts (Card(N) denotes the cardinality of the
set N)
 Definitions
 Directly density-reachable (p.51 & Figure 2)  density


reachable [transitivity of ddr]
Density-connected: (p -> o <- q)
Core-distance , MinPts (p): MinPts_distance (p)
Reachability-distance , MinPts (p,o) wrt o: max(core-distance(o),
dist(o,p))  Figure 4
 Ex) cluster ordering  reachability values; Fig 12
Advanced Database Systems, mod1-3, 2006
69
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander, Data Bubbles: Quality Preserving Performance Boosting
for Hierarchical Clustering, ACM SIGMOD, 2001.
 -neighborhood of P
 k-distance of P, at least for k
objects O’  D it holds
N  P   X  D | d P, X    
d(P,O’) ≤ d(P,O), and at
k  dist P   d P, O 
most k-1 objects O’  D it
N k  dist P  P   N k P 
holds d(P,O’) < d(P,O).
 k-nearest neighbors of P
 MinPts-dist(P): a distance in
which there are at least
MinPts objects within the neighborhood of P.
Advanced Database Systems, mod1-3, 2006
70
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander, Data Bubbles: Quality Preserving Performance Boosting
for Hierarchical Clustering, ACM SIGMOD, 2001.
 Structural distortion

Figure 11
 Data Bubbles, B=(n,rep,extend,nnDist)


n: # of objects in X; rep: a representative bject for X; extent: estimation of
the radius of X; nnDist: partial function, estimating k-nearest neighbor
distances in X.
Distance (B,C): page 6-83
Dist(B.rep, C.rep) - [B.extent + C.extend]
+ [B.nnDist(1) + C.nnDist(1)]
Advanced Database Systems, mod1-3, 2006
Max [B.nnDist(1) +
C.nnDist(1)]
71
K-Means in SQL
C. Ordonez, Integrating K-Means Clustering with a Relational DBMS Using SQL, IEEE TKDE 2006.
 Dataset: Y={y1,y2,…,yn} dn matrix, where yi=d1 column
vector
 K-Means: to find k clusters, by minimizing the square
Nj  Xj
error from the centers.




Square distance, Eq(1); and objective fn, Eq(2)
Matrices
• W: k weights (fractions of n) dk matrix
• C: k means (centroids) dk matrix
• R: k variances (square distances) k1 matrix
Matrices
• Mj: contains the d sums of point dimension values in
cluster j; dk matrix;
• Qj: contains the d sums of squared dimension values in
cluster j; dk matrix
• Nj: contains # points in cluster j; k1 matrix
Intermediate matrices: YH, YV, YD, YNN, NMQ, WCR;
Figure 193
Advanced Database Systems, mod1-3, 2006
Mj 
y
i
y i X j
Qj 
y y
i
T
i
y i X j
Wj 
Nj

k
j1
Nj
Cj 
Mj
Nj
Rj 
Qj
 C j C Tj
Nj
72
Y
YH
CH
YV
C
Y1
Y2
Y3
i
Y1
Y2
Y3
j
Y1
Y2
Y3
i
l
val
l
k
C1/C2
1
2
3
1
1
2
3
1
1
2
3
1
1
1
1
1
1
1
2
3
2
1
2
3
2
9
8
7
1
2
2
2
1
2
9
8
7
3
9
8
7
1
3
3
3
1
3
9
8
7
4
9
8
7
2
1
1
1
2
9
9
8
7
5
9
8
7
2
2
2
2
2
8
2
3
3
3
2
7
3
1
9
Insert into C
Select 1,1,Y1 From CH Where j=1;
…
Insert into C
Select d,k,Yd From CH Where j=k;
Insert into YD
Select i, sum(YV.val-C.C1)**2) AS d1,
… sum(YV.val-C.Ck)**2) AS dk
FROM YV, C
Where YV.l = C.l Group by i;
i
d1
d2
i
j
3
2
8
1
0
116
1
1
3
3
7
2
0
116
2
1
4
1
9
3
116
0
3
2
4
2
8
4
116
0
4
2
4
3
7
5
116
0
5
2
5
1
9
5
2
8
5
3
7
Insert into YNN
CASE When d1 <= d2 AND d1 <= dk
Then 1
When d2 <= d3 .. Then 2
ELSE k
Insert into MNQ Select l,j,sum(1.0) AS N,
sum(YV.val) AS M, sum(YV.va.*YV.val) AS Q
FROM YV, YNN Where YV.i = YNN.i
GROUP by l,j;
Advanced Database Systems, mod1-3, 2006
YNN
YD
NMQ
WCR
l
j
N
M
Q
l
j
W
C
R
1
1
2
2
3
1
1
0.4
1
0
2
1
2
4
3
2
1
0.4
2
0
3
1
2
6
7
3
1
0.4
3
0
1
2
3
27
7
1
2
0.6
9
0
2
2
3
24
7
2
2
0.6
8
0
3
2
3
21
7
3
2
0.6
7
0
73
Incremental Data Summarization
S. Nassar, J. Sander, C. Cheng, Incremental and Effective Data Summarization for Dynamic
Hierarchical Clustering, ACM SIGMOD, 2004.
For D={Xi} for 1iN, ={data bubble}, the
data index i= n/N.
For D={Xi} with the mean X and standard
deviation X, P X   X  k X   1  12
k
 is
 “good” iff [ -  ,  + ],
 “under-filled” iff <  -  , and
 “over-filled” iff >  + .
Advanced Database Systems, mod1-3, 2006
74
Research Issues
Reduction Dimensions
Approximation
Advanced Database Systems, mod1-3, 2006
75
Advanced Database Systems, mod1-3, 2006
76
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998
Cure: The Algorithm
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large
Databases, ACM SIGMOD, 1998
Draw random sample s.
Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
 By random sampling
 If a cluster grows too slow, eliminate it.
Cluster partial clusters.
Label data in disk
Advanced Database Systems, mod1-3, 2006
77
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998
Data Partitioning and Clustering
 s = 50
 p=2
 s/p = 25

s/pq = 5
y
y
y
x
y
y
x
x
Advanced Database Systems, mod1-3, 2006
x
x
78
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998
Cure: Shrinking Representative Points
y
y
x
x
 Shrink the multiple representative points towards the
gravity center by a fraction of .
 Multiple representatives capture the shape of the cluster
Advanced Database Systems, mod1-3, 2006
79
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
Advanced Database Systems, mod1-3, 2006
80
CLIQUE (Clustering In QUEst)
Agrawal, Gehrke, Gunopulos, Raghavan, Automatic Subspace Clustering of
High Dimensional Data for Data Mining Applications, ACM SIGMOD
1998.
 Automatically identifying subspaces of a high
dimensional data space that allow better clustering than
original space
 CLIQUE can be considered as both density-based and
grid-based
 It partitions each dimension into the same number of equal
length interval
 It partitions a d-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
Advanced Database Systems, mod1-3, 2006
81
40
50
20
30
40
50
age
60
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
Advanced Database Systems, mod1-3, 2006
82
Agrawal, Gehrke, Gunopulos, Raghavan, Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications, ACM SIGMOD 1998.
CLIQUE: The Major Steps
 Partition the data space and find the number of points
that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of connected

dense units for each cluster
Determination of minimal cover for each cluster
Advanced Database Systems, mod1-3, 2006
83
Strength and Weakness of CLIQUE
Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 It is insensitive to the order of records in input and
does not presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
Advanced Database Systems, mod1-3, 2006
84
Model based clustering
Assume data generated from K probability
distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
Need to find distribution parameters.
EM Algorithm
Advanced Database Systems, mod1-3, 2006
85
EM Algorithm
 Initialize K cluster centers
 Iterate between two steps
 Expectation step: assign points to clusters
 w Pr(d
P(di  ck )  wk Pr( di | ck )
wk 
j
 Pr( d  c )
i
i
|cj )
j
k
i
N
 Maximation step: estimate model parameters
k
Advanced Database Systems, mod1-3, 2006
1

m
d i P ( d i  ck )

i 1  P ( d i  c j )
m
k
86
CURE (Clustering Using Epresentatives )
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for
Large Databases, ACM SIGMOD, 1998
 Stops the creation of a cluster hierarchy if a level
consists of k clusters
 Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
Advanced Database Systems, mod1-3, 2006
87
Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998
Drawbacks of Distance-Based Method
Drawbacks of square-error based clustering
method
 Consider only one point as representative of a cluster
 Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
Advanced Database Systems, mod1-3, 2006
88
Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996.
BIRCH
 Dependent on order of insertions
 Works for convex, isotropic clusters of uniform
size
 Labeling Problem
 Centroid approach:
 Labeling Problem: even with correct centers, we
cannot label correctly
Advanced Database Systems, mod1-3, 2006
89
Jensen-Shannon Divergence
 Jensen-Shannon(JS) divergence between two probability
distributions:
JS ( p1 , p2 )   1 KL( p1 ,  1 p1   2 p2 )   2 KL( p2 ,  1 p1   2 p2 )
 H ( 1 p1   2 p2 )   1 H ( p1 )   2 H ( p2 )
where
1 , 2  0, 1   2  1
 Jensen-Shannon(JS) divergence between a finite number of
probability distributions:
JS  ({ p1 ,...., p n })    i KL( pi ,  1 p1  .....   n p n )
i


 H    i pi     i H ( pi )
 i
 i
Advanced Database Systems, mod1-3, 2006
90
Information-Theoretic Clustering:
(preserving mutual information)
 (Lemma) The loss in mutual information equals:
k
I ( X , Y )  I ( X , Yˆ )   ( yˆ j ) JS * ({p( x | yt ) : yt  yˆ j })
j 1
 Interpretation: Quality of each cluster is measured by
the Jensen-Shannon Divergence between the
individual distributions in the cluster.
 Can rewrite the above as:
k
I ( X , Y )  I ( X , Yˆ )     t KL( p( x | yt ), p( x | yˆ j ))
j 1 yt  yˆ j
 Goal: Find a clustering that minimizes the above loss
Advanced Database Systems, mod1-3, 2006
91
Information Theoretic Co-clustering
(preserving mutual information)
(Lemma) Loss in mutual information equals
I ( X , Y ) - I ( Xˆ , Yˆ )  KL( p( x, y ) || q( x, y ))
 H ( Xˆ , Yˆ )  H ( X | Xˆ )  H (Y | Yˆ )  H ( X , Y )
where
q( x, y )  p( xˆ, yˆ ) p( x | xˆ ) p( y | yˆ ), where x  xˆ, y  yˆ
 Can be shown that q(x,y) is a “maximum entropy”

approximation to p(x,y).
q(x,y) preserves marginals : q(x)=p(x) & q(y)=p(y)
Advanced Database Systems, mod1-3, 2006
92
p ( x, y )
.05
.05
 00
.04
.04
.5
.5
0
 00
 0
0
0
.5
.5
0
0
.05 .05
0
0
.05 .05
0
0
0
0
0
0
.05 .05
.05 .05
.04
0
.04 .04
.04 .04

0
0

0

.5

.5
0
0
.04
.03 .03
.2 .2
p( xˆ, yˆ )

0

.05

.05

.04

.04
0

.36 .36 .28
0
0
.28 .36 .36
0
0
p ( y | yˆ )
p( x | xˆ )
0
0

.054
.054
 00
.036
.036
.054
.042
0
0
.054
.042
0
0
0
0
0
0
.042 .054
.042 .054
.036
028
.028 .036
.036
.028
.028 .036

0

.054

.054

.036

.036
0
q ( x, y )
#parameters that determine q are: (m-k)+(kl-1)+(n-l)
Advanced Database Systems, mod1-3, 2006
93
Preserving Mutual Information
 Lemma:
KL( p( x, y ) || q( x, y ))   p( x) KL( p( y | x) || q( y | xˆ ))
xˆ
where
xxˆ
q( y | xˆ )  p( y | yˆ ) p( yˆ | xˆ )  p( y | yˆ ) p( yˆ | x) p( x | xˆ )
xxˆ
Note that q ( y | xˆ ) may be thought of as the “prototype” of
x̂ the cluster is
row cluster (the usual “centroid” of
)
p( y | x) p( x | xˆ )

xxˆ
Similarly, KL( p( x, y ) || q( x, y ))   p( y ) KL( p( x | y ) || q( x | yˆ ))
yˆ y yˆ
Advanced Database Systems, mod1-3, 2006
94
Example – Continued
q ( y | xˆ )
.36
.36
.28
0
0
0
0
0
0
.28
.36
.36
.18
.18
.14
.14
.18
.18
.30
0
0
.30
.30
0
.20
.20
.30
0
0
.30
0
.30
p( xˆ, yˆ )
q ( x | yˆ )
.16 .24
.24
Advanced Database Systems, mod1-3, 2006
.16
95
Co-Clustering Algorithm
Advanced Database Systems, mod1-3, 2006
96
Properties of Co-clustering
Algorithm
Theorem: The co-clustering algorithm
monotonically decreases loss in mutual
information (objective function value)
Marginals p(x) and p(y) are preserved at every
step (q(x)=p(x) and q(y)=p(y) )
Can be generalized to higher dimensions
Advanced Database Systems, mod1-3, 2006
97
Advanced Database Systems, mod1-3, 2006
98
Applications -- Text Classification


Assigning class labels to text documents
Training and Testing Phases
New Document
Class-1
Document
collection
Grouped into
classes
Class-m
Classifier
(Learns from
Training data)
New
Document
With
Assigned
class
Training Data
Advanced Database Systems, mod1-3, 2006
99
Dimensionality Reduction
 Feature Selection
Document
Bag-of-words
Vector
Of
words
1
Word#1
Word#k
• Select the “best” words
• Throw away rest
• Frequency based pruning
• Information criterion based
pruning
m
 Feature Clustering
Document
Bag-of-words
Vector
Of
words
1
Cluster#1
• Do not throw away words
• Cluster words instead
• Use clusters as features
Cluster#k
m
Advanced Database Systems, mod1-3, 2006
100
Experiments
 Data sets

20 Newsgroups data
•20 classes, 20000 documents
 Classic3 data set

•3 classes (cisi, med and cran),
3893 documents
Dmoz Science HTML data
•49 leaves in the hierarchy
•5000 documents with 14538 words
•Available at
http://www.cs.utexas.edu/users/manyam/dmoz.
txt
 Implementation Details

Bow – for indexing,co-clustering, clustering and classifying
Advanced Database Systems, mod1-3, 2006
101
Naïve Bayes with word clusters
 Naïve Bayes classifier
 Assign document d to the class with the highest score
v
c (d )  arg max i (log( p(ci ))   p( wt | d ) log( p( wt | ci )))
*


Relation to KL Divergence
t 1
c* (d )  arg min i ( KL( p(W | d ), p(W | ci )  log p(ci )))
Using word clusters instead of words
k
c (d )  arg max i (log( p(ci ))   p( xˆs | d ) log( p( xˆs | ci )))
*
s 1
where parameters for clusters are estimated according to joint
statistics
Advanced Database Systems, mod1-3, 2006
102
T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules, Proc. of VLDB Conf., 1996.
Selecting Correlated Attributes
To decide A and A’ are strongly correlated iff
E ( S )  E ( S ( Ropt ); S ( Ropt ))
E ( S )  min{E ( S ( I ); S ( I )), E ( S ( I '); S ( I '))}

where a threshold   1
Advanced Database Systems, mod1-3, 2006
103
MDL-based Decision Tree Pruning
M. Mehta, J. Rissanen, R. Agrawal, MDL-based Decision Tree
Pruning, Proc. on KDD Conf., 1995.
 Two steps for induction of decision trees
1. Construct a DT using training data
2. Reduce the DT by pruning to prevent
“overfitting”
 Possible approaches



Cost-complexity pruning: using a separate set of
samples for pruning
DT pruning: using the same training data sets for
testing
MDL-based pruning: using Minimum
Description Length (MDL) principle.
Advanced Database Systems, mod1-3, 2006
104
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
Pruning Using MDL Principle
 View decision tree as a means for efficiently encoding
classes of records in training set
 MDL Principle: best tree is the one that can encode
records using the fewest bits
 Cost of encoding tree includes
 1 bit for encoding type of each node (e.g. leaf or
internal)
 Csplit : cost of encoding attribute and value for each
split
 n*E: cost of encoding the n records in each leaf (E
is entropy)
Advanced Database Systems, mod1-3, 2006
105
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
Pruning Using MDL Principle
 Problem: to compute the minimum cost subtree at
root of built tree
 Suppose minCN is the cost of encoding the minimum
cost subtree rooted at N
 Prune children of a node N if minCN = n*E+1
 Compute minCN as follows:
 N is leaf: n*E+1
 N has children N1 and N2:
min{n*E+1,Csplit+1+minCN1+minCN2}
 Prune tree in a bottom-up fashion
Advanced Database Systems, mod1-3, 2006
106
R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, Proc. of VLDB Conf., 1998.
MDL Pruning - Example
Root
education in graduate
no
10
18
40
reject
reject
accept
1
5
2
yes
N 3.8
no X X yes
salary < 40k
N1
1
N2
1
• Cost of encoding records in N, (n*E+1) = 3.8
• Csplit = 2.6
• minCN = min{3.8,2.6+1+1+1} = 3.8
• Since minCN = n*E+1, N1 and N2 are pruned
Advanced Database Systems, mod1-3, 2006
107
PUBLIC
R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates
Building and Pruning, Proc. of VLDB Conf., 1998.
 Prune tree during (not after) building phase
 Execute pruning algorithm (periodically) on partial tree
 Problem: how to compute minCN for a “yet to be
expanded” leaf N in a partial tree
 Solution: compute lower bound on the subtree cost at N
and use this as minCN when pruning
 minCN is thus a “lower bound” on the cost of subtree
rooted at N
 Prune children of a node N if minCN = n*E+1
 Guaranteed to generate identical tree to that generated
by SPRINT
Advanced Database Systems, mod1-3, 2006
108
R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, Proc. of VLDB Conf., 1998.
PUBLIC(1)
sal
education
Label
10K
High-school
Reject
40K
Under
Accept
15K
Under
Reject
75K
grad
Accept
18K
grad
Accept
education in graduate
N
5.8
no
N1 1
yes
N2
1
• Simple lower bound for a subtree: 1
• Cost of encoding records in N = n*E+1 = 5.8
• Csplit = 4
• minCN = min{5.8, 4+1+1+1} = 5.8
• Since minCN = n*E+1, N1 and N2 are pruned
Advanced Database Systems, mod1-3, 2006
109
PUBLIC(S)
Theorem: The cost of any subtree with s splits and
k
rooted at node N is at least 2*s+1+s*log a +  ni
i=s+2
 a is the number of attributes
 k is the number of classes
 ni (>= ni+1) is the number of records belonging to
class i
Lower bound on subtree cost at N is thus the minimum
of:
 n*E+1 (cost with zero split)
k
 2*s+1+s*log a +  ni
i=s+2
Advanced Database Systems, mod1-3, 2006
110
What’s Clustering
 Clustering is a kind of unsupervised learning.
 Clustering is a method of grouping data that share
similar trend and patterns.
 Clustering of data is a method by which large sets of
data is grouped into clusters of smaller sets of similar
data.

Example:
After clustering:
Thus, we see clustering means grouping of data or dividing a large
data set into smaller data sets of some similarity.
Advanced Database Systems, mod1-3, 2006
111
Partitional Algorithms
Enumerate K partitions optimizing some
criterion
Example: square-error criterion
e  ,
2
k
nj
  
j 1 i 1
( j)
i
x
 cj
2
 Where x is the ith pattern belonging to the jth cluster
and c is the centroid of the jth cluster.
Advanced Database Systems, mod1-3, 2006
112
Squared Error Clustering Method
1. Select an initial partition of the patterns with
a fixed number of clusters and cluster
centers
2. Assign each pattern to its closest cluster
center and compute the new cluster centers
as the centroids of the clusters. Repeat this
step until convergence is achieved, i.e., until
the cluster membership is stable.
3. Merge and split clusters based on some
heuristic information, optionally repeating
step 2.
Advanced Database Systems, mod1-3, 2006
113
Agglomerative Clustering Algorithm
1. Place each pattern in its own cluster. Construct a list
of interpattern distances for all distinct unordered
pairs of patterns, and sort this list in ascending order
2. Step through the sorted list of distances, forming for
each distinct dissimilarity value dk a graph on the
patterns where pairs of patterns closer than dk are
connected by a graph edge. If all the patterns are
members of a connected graph, stop. Otherwise,
repeat this step.
3. The output of the algorithm is a nested hierarchy of
graphs with can be cut at a desired dissimilarity
level forming a partition identified by simply
connected components in the corresponding graph.
Advanced Database Systems, mod1-3, 2006
114
Agglomerative Hierarchical Clustering
 Mostly used hierarchical clustering algorithm
 Initially each point is a distinct cluster
 Repeatedly merge closest clusters until the number of
clusters becomes K
 Closest: dmean (Ci, Cj) = m m
i
dmin (Ci, Cj) = pmin
, q
j
pq
Ci C j
Likewise dave (Ci, Cj) and dmax (Ci, Cj)
Advanced Database Systems, mod1-3, 2006
115
Clustering
Summary of Drawbacks of Traditional Methods
 Partitional algorithms split large clusters
 Centroid-based method splits large and nonhyperspherical clusters
 Centers of subclusters can be far apart
 Minimum spanning tree algorithm is sensitive to
outliers and slight change in position
 Exhibits chaining effect on string of outliers
 Cannot scale up for large databases
Advanced Database Systems, mod1-3, 2006
116
Model-based Clustering
Mixture of Gaussians
 Gaussian pdf: P(i)
 Data point, N(i,2I)
Consider
 Data points: x1, x2,…, xN
 P(1),…, P(k),
Likelihood function
L  Pdata | i    i Pi Px | i , 1 , 2 ,..., k 
N
i 1
 Maximize the likelihood function by calculating
L
0
i
Advanced Database Systems, mod1-3, 2006
117
Overview of EM Clustering
 Extensions and generalizations. The EM (expectation
maximization) algorithm extends the k-means clustering
technique in two important ways:
1. Instead of assigning cases or observations to clusters to
maximize the differences in means for continuous variables, the
EM clustering algorithm computes probabilities of
cluster memberships based on one or more
probability distributions. The goal of the clustering algorithm
then is to maximize the overall probability or likelihood of the data,
given the (final) clusters.
2. Unlike the classic implementation of k-means clustering, the
general EM algorithm can be applied to both
continuous and categorical variables (note that the classic
k-means algorithm can also be modified to accommodate categorical
variables).
Advanced Database Systems, mod1-3, 2006
118
EM Algorithm
 The EM algorithm for clustering is
described in detail in Witten and Frank
(2001).
 The basic approach and logic of this
clustering method is as follows.



Suppose you measure a single
continuous variable in a large sample
of observations.
Further, suppose that the sample
consists of two clusters of
observations with different means
(and perhaps different standard
deviations); within each sample, the
distribution of values for the
continuous variable follows the
normal distribution.
The resulting distribution of values
(in the population) may look like this:
Advanced Database Systems, mod1-3, 2006
119
EM v.s. k-Means
 Classification probabilities instead of
classifications. The results of EM clustering are
different from those computed by k-means clustering.
The latter will assign observations to clusters to
maximize the distances between clusters. The EM
algorithm does not compute actual assignments of
observations to clusters, but classification probabilities.
In other words, each observation belongs to each
cluster with a certain probability. Of course, as a final
result you can usually review an actual assignment of
observations to clusters, based on the (largest)
classification probability.
Advanced Database Systems, mod1-3, 2006
120
Finding
k
 V-fold cross-validation. This type of cross-validation is useful when
no test sample is available and the learning sample is too small to
have the test sample taken from it. A specified V value for V-fold
cross-validation determines the number of random subsamples, as
equal in size as possible, that are formed from the learning sample.
The classification tree of the specified size is computed V times, each
time leaving out one of the subsamples from the computations, and
using that subsample as a test sample for cross-validation, so that
each subsample is used V - 1 times in the learning sample and just
once as the test sample. The CV costs computed for each of the V test
samples are then averaged to give the V-fold estimate of the CV costs.
Advanced Database Systems, mod1-3, 2006
121
L  Pdata | i    i Pi Px | i , 1 , 2 ,..., k 
N
i 1
Expectation Maximization
 A mixture of Gaussians
 Ex: x1=30, P(x1)=1/2; x2=18, P(x2)=u; x3=0, P(x3)=2u;
x4=23, P(x4)=1/2-3u

Likelihood for X1: a students; x2: b students; x3: c students; x4: d
students
a
d
1

b
c 1
L  Pa, b, c, d | i         2     3 
2
2

 To maximize L, calculate the log Likelihood L:
a
d
1
1

b
c




PL  log    log   log 2   log   3 
2
2

PL b 2c
3d
Supposing a=14, b=6,
 

0
c=9,d=10, then u=1/10.
  2  1  3
2
If x1+x2: h students  a+b=h
bc

 a=h/(u+1), b=2uh/(u+1)
6(b  c  d )
Advanced Database Systems, mod1-3, 2006
122
Gaussian (Normal) pdf
1
p( x |  ,  ) 
e
 2
2
1  x 
 

2  
2
 The Gaussian function with mean () and standard deviation
(). The properties of the function:
- symmetric about the mean
- Gains its maximum value at the mean, the minimum value at
plus and minus infinity
- The distribution is often referred to as “bell shaped”
- At one standard deviation from the mean the function has
dropped to about 2/3 of its maximum value, at two standard
deviations it has falled to about a 1/7.
- The area under the function one standard deviation from the
mean is about 0.682. Two standard deviations it is 0.9545, and
the three s.d. it is 0.9973. The total area under the curve is 1.
Advanced Database Systems, mod1-3, 2006
123
1
p( x |  ,  ) 
e
 2
2
Gaussian
x
F , 2 ( x)   p( z |  ,  2 )dz
1  x 
 

2  
2
Think the cumulative
distribution, F,2(x)

 ~ Uniform (0,1)  x  F 1 ( x) ~ p( x |  ,  2 )
 , 2
Advanced Database Systems, mod1-3, 2006
124
Multi-variate Density Estimation
p( x |  )   p px |  , 
Mixture of
k
j 1
j
2
j
j
  p1 ,..., pk , 1 ,... k , ,...,
2
1
2
k

Gaussians
contains all the parameters of the mixture
model. {pi} are known as mixing proportions
or coefficients.
A mixture of Gaussians model
Generic mixture
p( x |  )   P y  j   px | y  j 
P(y)
y=1
y=2
j 1, 2
P(x|y=1) P(x|y=2)
Advanced Database Systems, mod1-3, 2006
125
Mixture Density
 If we are given just x we do not know which
mixture component this example came from

k
p( x |  )   p j p x |  j , j2

j 1
We can evaluate the posterior probability that
an observed x was generated from the first
mixture component
P( y  1)  p( x | y  1)
P ( y  1 | x , ) 
 P( y  1)  p(x | y  1)
j 1, 2


p1 p x | 1 ,  12

2
p
p
x
|

,

 j
j
j


j 1, 2
Advanced Database Systems, mod1-3, 2006
126