Download 6340 Lecture on Object-Similarity and Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
KDD --- 2004 Lectures





March 4+9: Introduction to KDD
March 11: Association Rule Mining
March 23: Similarity Assessment
March 25: Clustering and UHDM2
March 30: Data Warehouses and
OLAP
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
1
Clustering and
Similarity Assessment
©Jiawei Han and Micheline Kamber
with major Additions and Modifications by Ch. Eick
Organization for COSC 6340:
1. What is Clustering?
2. Object Similarity Assessment
3. K-means/medoid Clustering
4. Grid-based Clustering
5. Work at UH
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
2
Motivation: Why Clustering?
Problem: Identify (a small number of) groups of
similar objects in a given (large) set of object.
Goals:
 Find representatives for homogeneous groups
Data Compression
 Find “natural” clusters and describe their properties
”natural” Data Types
 Find suitable and useful grouping ”useful” Data
Classes
 Find unusual data object Outlier Detection
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
4
Examples of Clustering Applications

Plant/Animal Classification

Book Ordering

Cloth Sizes

Fraud Detection (Find outlier)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
5
Requirements of Clustering in Data
Mining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
6
Data Structures for Clustering


Data matrix
 (n objects,
p attributes)
(Dis)Similarity
 (nxn)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0
matrix 
:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
7
Quality Evaluation of Clusters





Dissimilarity/Similarity metric: Similarity is expressed in
terms of a normalized distance function d, which is
typically metric; typically: d (oi, oj) = 1 - d (oi, oj)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of similarity functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio-scaled variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
8
Challenges in Obtaining
Object Similarity Measures

Many Types of Variables

Interval-scaled variables
Binary variables and nominal variables
Ordinal variables

Ratio-scaled variables



Objects are characterized by variables belonging to
different types (mixture of variables)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
9
Case Study: Patient Similarity
The following relation is given (with 10000 tuples):
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 Attribute Domains
 ssn: 9 digits



weight between 30 and 650; mweight=158 sweight=24.20
height between 0.30 and 2.20 in meters; mheight=1.52
sheight=19.2

cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

eye-color: {brown, blue, green, grey }

age: between 3 and 100; mage=45 sage=13.2
Task: Define Patient Similarity
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
10
Generating a Global Similarity Measure
from Single Variable Similarity Measures
Assumption: A database may contain up to six
types of variables: symmetric binary, asymmetric
binary, nominal, ordinal, interval and ratio.
1. Standardize variable and associate similarity
measure di with the standardized i-th variable
and determine weight wi of the i-th variable.
2. Create the following global (dis)similarity
measure d:
d (o , o ) 
i
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
j
p

df (oi oj ) * wf
f 1
p

wf
f 1
,
11
A Methodology to Obtain a Similarity Matrix
1. Understand Variables
2. Remove (non-relevant and redundant) Variables
3. (Standardize and) Normalize Variables (typically using z4.
5.
6.
7.
scores or variable values are transformed to numbers in
[0,1])
Associate (Dis)Similarity Measure df/df with each Variable
Associate a Weight (measuring its importance) with each
Variable
Compute the (Dis)Similarity Matrix
Apply Similarity-based Data Mining Technique (e.g.
Clustering, Nearest Neighbor, Multi-dimensional
Scaling,…)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
12
Interval-scaled Variables

Standardize data using z-scores

Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where

m f  1n (x1 f  x2 f  ...  xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf

Using mean absolute deviation is more robust than using
standard deviation
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
13
Normalization in [0,1]
Problem: If non-normalized variables are used the maximum
distance between two values can be greater than 1.
Solution: Normalize interval-scaled variables using
 (x  min ) /((max  min )*s )
z
if
if
f
f
f
where minf denotes the minimum value and maxf denotes
the maximum value of the f-th attribute in the data set
and s is constant that is choses depending on the
similarity measure (e.g. if Manhattan distance is used s is
chosen to be 1).
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
14
Other Normalizations
Goal:Limit the maximum distance to 1
 Start using a distance measure df(x,y)
 Determine the maximum distance dmaxf that
can occur for two values of the f-th attribute
(e.g. dmaxf=maxf-minf ).
 Define df(x,y)=1- (df(x,y)/ dmaxf )
Advantage: Negative similarities cannot occur.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
15
Similarity Between Objects


Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
16
Similarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
17
Similarity with respect to
a Set of Binary Variables

A contingency table for binary data
Object j
Object i
1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
a
dJaccard(i, j ) 
abc
a

d
dsym(i, j ) 
abcd
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
Ignores agreements in O’s
Considers agreements in 0’s and 1’s
to be equivalent.
18
Similarity between Binary Variable Sets

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
2
 0.67
2  0 1
1
dJacc( jack , jim) 
 0.33
111
1
dJacc( jim, mary) 
 0.25
11 2
dJacc( jack , mary) 
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
19
Nominal Variables


A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (o , o )  p 
p
i

j
Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
20
Ordinal Variables

An ordinal variable can be discrete or continuous

order is important (e.g. UH-grade, hotel-rating)

Can be treated like interval-scaled


replacing xif by their rank: rif  {1,...,M f }
map the range of each variable onto [0, 1] by replacing
the f-th variable of i-th object by
rif 1
zif 
M 1
f

compute the dissimilarity using methods for intervalscaled variables
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
21
Ratio-Scaled Variables


Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:

treat them like interval-scaled variables — not a good
choice! (why?)

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their rank
as interval-scaled.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
22
Case Study --- Normalization
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 Attribute Relevance: ssn no; eye-color minor; other major
 Attribute Normalization:
 ssn remove!




weight between 30 and 650; mweight=158 sweight=24.20;
transform to zweight= (xweight-158)/24.20 (alternatively,
zweight=(xweight-30)/620));
height normalize like weight!
cancer_sev: 4=serious 3=quite_serious 2=medium
1=minor; transform 4 to 1, 3 to 2/3, 2 to 1/3, 1 to 0
and then normalize like weight!
age: normalize like weight!
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
23
Case Study --- Weight Selection
and Similarity Measure Selection
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 For normalized weight, height, cancer_sev, age values use
Manhattan distance function; e.g.:
dweight(w1,w2)= 1  | ((w1-158)/24.20 )  ((w2-158)/24.20) |

For eye-color use: deye-color(c1,c2)= if c1=c2 then 1 else 0

Weight Assignment: 0.2 for eye-color; 1 for all others
Final Solution --- chosen Similarity Measure d:
Let o1=(s1,w1,h1,cs1,e1,a1) and o2=(s2,w2,h2,cs2,e2,a2)
d(o1,o2):= (dweight(w1,w2) + dheight(h1,h2) + dcancersev(cs1,cs2)
+ dage(a1,a2) + 0.2* deye-color(e1,e2) ) /4.2
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
24
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
25
Partitioning Algorithms: Basic Concept


Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by
the center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
26
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition. The centroid is
the center (mean point) of the cluster.
 Assign each object to the cluster with the nearest
seed point.
 Go back to Step 2, stop when no more new
assignment.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
27
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
28
Comments on the K-Means Method

Strength



Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
29
PAM (Partitioning Around Medoids)
(1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster



Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,



If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
30
PAM Clustering: Total swapping cost TCih=jCjih
10
10
9
9
t
8
7
7
6
5
i
4
3
j
6
h
4
5
h
i
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
Cjih = 0
10
10
9
9
h
8
8
7
j
7
6
6
i
5
5
i
4
h
4
t
j
3
3
t
2
2
1
1
0
0
0
C
j
t
8
1
2
3
4
5
6
7
8
9
= d(j, t) - d(j, i)
jih & Clustering for COSC 6340
Han, Kamber, Eick: Object Similarity
10
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, h) - d(j, t)
10
31
CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)





CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
32
Grid-Based Clustering Method

Using multi-resolution grid data structure

Several interesting methods


STING (a STatistical INformation Grid approach)
by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)


A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
33
STING: A Statistical Information
Grid Approach



Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
34
STING: A Statistical Information
Grid Approach (2)






Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small
number of cells
For each cell in the current level compute the confidence
interval
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
STING: A Statistical Information
Grid Approach (3)





Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to
the next lower level
Repeat this process until the bottom layer is reached
Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space

CLIQUE can be considered as both density-based and gridbased

It partitions each dimension into the same number of
equal length interval

It partitions an m-dimensional data space into nonoverlapping rectangular units

A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units
within a subspace
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
37
CLIQUE: The Major Steps



Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:



Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
38
40
50
20
30
40
50
age
60
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
39
Strength and Weakness of CLIQUE


Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 It is insensitive to the order of records in input and
does not presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
40
Work at UH related to
Similarity Assessment and Clustering





Creating Environments for Database Clustering;
problems related to Multi-relational Data Mining
[ER04].
Distance Function Learning [EVR03]
Supervised Clustering [EZZ04]
Using Clustering to Enhance Classifiers [ICDM03],
[ECAI04], [PKDD04] not discussed
Using SQL Queries for Data Summarization
[KDD96]; [RYU98]; not discussed
Work at UH
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
41
CAL-FULL/UH Database Clustering
Similarity Assessment Environments
Library of
clustering algorithms
Training
Date
A set of
clusters
Learning
Tool
Object
View
Data Extraction
Tool
DBMS
Clustering Tool
User Interface
Similarity
measure
Similarity
Measure Tool
Default choices
and domain
information
Library of
similarity
measures
Type and
weight
information
Work at UH
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
42
Prototypes of Similarity Assessment Tools



Prototype1 (CAL State Fullerton): Supported the interactive
definition of similarity measures; knowledge representation
format does not rely on modular units; provides a nearest
neighbor clustering algorithm for database clustering; functions
were supported outside a DBMS
Prototype 2 (UH 2002): Similarity measures are defined using
a special language (not interactively); tool supports modular
units and functions are provided using a Java/SQL-Server 2000
framework; functions were partially moved inside a DBMS
(although some are still inside Java); analysis results are stored
in the database and therefore available for further analysis.
Prototype 3 (UH): Learn Distance Functions for Classification
Problems. Currently Investigated!
Work at UH
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
43
Work at UH
Objectives Supervised Clustering: Maximize Cluster Purity
while keeping the number of clusters low.
Research Goals Supervised Clustering


Develop representative-based supervised
clustering algorithms.
Show the benefits of supervised
clustering in case studies that center on
summary generation, distance function
learning, and classification.
Work at UH
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
45
Work at UH
What is a good object distance function q
for supervised similarity assessment?
q (o , o ) 
i



j

p
qf (oi oj ) * wf
f 1
p

wf
f 1
,
Objective: Learn good distance functions for
classification tasks.
Our approach: Apply a clustering algorithm with the
distance function q to be evaluated that returns a
predetermined number of clusters k. The more pure
the obtained clusters are the better is the quality of q.
Our goal is to learn the weights of an object distance
function q such that all the clusters are pure (or as
pure is possible); for more details see [ERV03] Paper. 46
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
Work at UH
Idea: Coevolving Clusters and Similarity
Functions
Reinforcement Learning
Similarity
Function
Clusters X
Clustering
q(X) Clustering
Evaluation q(X):=percentage_of_minority_examples +
penalty(k)
penalty(k):= If kc then 0 else sqrt((k-c)/n))
Goodness of
The Similarity with
k:= number of clusters generated
Function
n:= number of objects in the dataset
c:= number of classes in the dataset
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
47
Idea CR*-Approach
Let Y be a clustering algorithm and Error(q,O)=Error’(Y(q,O)) an error function
that measures class purity in clusters, class coverage, and assigns a
penalty for large numbers of clusters.
While not done do
1.
Cluster with respect to (q,O) receiving clusters C and report
Error’(Y(q,O))
2.
If Error’(Y(q,O)) is small enough stop reporting the error, C, and q
3.
For each cluster determine majority class
Cluster
4.
For each ceC adjust weights wj locally
xo
X
x:=examples belonging to majority class
o:= non-majority-class examples
o o
x
x
Decrease weight for modular unit
Cluster
Idea: Move examples of the
majority class closer to each other
Work at UH
Xx
xx
o o
o
Increase weight for modular unit
Work at UH
Weight Adjustment within a Cluster
Let wi be the current weight of the i-th modular unit
Let si be the average absolute deviation for the examples
that belong to the cluster with respect to fi
Let mi be the average absolute deviation for the examples of
the cluster that belong to the majority class with respect
to fi
Learning: Then weights are adjusted as follows with respect to a
particular cluster:
wi’=wi+ (si – mi) *a or better
wi’=wi+ wi*min(max(b,(si – mi) *a),b)
with a being the learning rate and b maximal adjustment (e.g. if b0.2
a weight can be maximally increased/decreased by 20%) per weight
per cluster.
Remark: If the cluster is ‘pure’ or does not contain 2 or more
elements of a particular class, no weight adjustment takes place.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
49
Summary: Problems and Challenges
for Clustering



Considerable progress has been made in scalable
clustering methods

Partitioning: k-means, k-medoid, CLARANS, EM

Hierarchical: BIRCH, CURE

Density-based: DBSCAN, CLIQUE, OPTICS

Grid-based: STING, WaveCluster

Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the
requirements adequately
Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
50
Summary Object Similarity & Clustering




Cluster analysis groups objects based on their similarity
and has wide applications
Appropriate similarity measures have to be chosen for
various types of variables and combined into a global
similarity measure.
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Methods to measure, compute, and learn object similarity
are quite important, not only for clustering, but also for
nearest neighbor approaches, information retrieval in
general, and for data visualization.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
51
References (1)










R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
52
References (2)









L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340
53