Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
```6、Cluster Analysis (6hrs)
6.1 What is Cluster Analysis?
6.2 Types of Data in Cluster Analysis
6.3 A Categorization of Major Clustering Methods
6.4 Partitioning Methods
6.5 Grid-Based Methods
6.6 Model-Based Methods
6.7 Clustering High-Dimensional Data
6.8 Outlier Analysis
6.9 Summary
Key Points：Clustering, Partition, Hierarchical method, Outlier Analysis
Q&A:
1.
Briefly outline how to compute the dissimilarity between
objects described by Categorical variables.
A categorical variable
is a generalization of the binary variable in that it can
take on more than two states.
The
[old:
dissimilarity between
the
simple
d(i,j)=(p-m)/p, where
two objects
i
and
j can
be
matching approach][new: on the ratio
m
is the
number
of matches
(i.e.,
computed based
of mismatches]:
the
number of
variables for which i and j are in the same state), and p is the total
number
of variables.
Alternatively, we can use a large number of binary variables by creating a new
binary
variable
for each of the M
nominal
states.
For an object
with a
given state value, the binary variable representing that state is set to 1, while
the remaining binary variables are set to 0.
2. Briefly outline how to compute the dissimilarity between objects
described by Ratio-scaled variables.
Three methods include:
• Treat ratio-scaled variables as interval-scaled variables, so that the Minkowski,
Manhattan, or Euclidean distance can be used to compute the dissimilarity.
• Apply a logarithmic transformation to a ratio-scaled variable f having value
for object i by using the formula yif
xif
= log(xif). The yif values can be treated as
interval-valued,
• Treat xif as continuous ordinal
data, and treat their ranks
as interval-scaled
variables.
3. Given the following measurements for the variable age:
18, 22, 25, 42, 28, 43, 33, 35, 56, 28,
Standardize the variable by the following:
(a) Compute the mean absolute deviation of age.
(b) Compute the z-score for the first four measurements.
(a) Compute the mean absolute deviation of age.
The mean absolute deviation of age is 8.8, which is derived as follows.
(b) Compute the z-score for the first four measurements.
According to the z-score computation formula,
4. Given two objects represented by the tuples (22, 1, 42, 10) and
(20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using p = 3.
5. Briefly describe the concepts of clustering and list several
approaches to clustering.
Clustering is the process of grouping data
objects
but
into
classes, or clusters, so that
within a cluster have high similarity in comparison
are very dissimilar
to objects in other
clusters.
to one another,
There are several
approaches to clustering: Partitioning methods, Hierarchical methods, Density-based
methods, Grid-based methods, Model-based methods,Constraint-based methods
6. What is the mainly concept of model-based methods for
clustering?
Model-based methods:
clusters
and
This
approach hypothesizes
finds the best
fit of the
data
a model for each of the
to the
given model.
A
model-based algorithm may locate clusters by constructing a density function
that reflects the spatial
distribution of the data points.
way of automatically determining the
statistics.
to
the
It
number
of the
approach.
feature maps are examples of model-based
a
of clusters based on standard
takes “noise” or outliers into account,
robustness
It
COBWEB
therein
and
contributing
self- organizing
clustering.
7. Suppose that the data mining task is to cluster the following eight
points (with (x, y) representing location) into three clusters.
A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9).
The distance function is Euclidean distance.
A1 , B1 , and
C1
as the
center
Suppose initially we assign
of each cluster,
respectively.
Use the
k-means algorithm to show only
(a) The three
cluster
(b) The final three
centers
clusters
after the first round of execution and
(a) After the first round, the three new clusters are:
B2 , B3 , C2 }, (3) {C1 , A2 }, and their centers
(1) {A1 }, (2) {B1 , A3 ,
are (1) (2, 10), (2) (6, 6), (3) (1.5,
3.5).
(b) The final three clusters are:
(1) {A1 , C2 , B1 }, (2) {A3 , B2 , B3 }, (3)
{C1 , A2 }.
8. Both k-means
and k-medoids
algorithms
can perform
effective clustering. Illustrate the strength and weakness of k-means
in comparison with the k-medoids algorithm. Also, illustrate the
strength and weakness of these schemes in comparison
with a
hierarchical clustering scheme (such as AGNES).
(a)
Illustrate the strength and weakness of k-means
in comparison
with the
k-medoids algorithm.
The k-medoids algorithm is more robust than
k-means in the presence of noise
and outliers, because a medoid is less influenced by outliers or other extreme
values than a mean.
However, its processing is more costly than the k-means
method.
(b)
Illustrate the strength and weakness of these schemes in
hierarchical
comparison
with
a
clustering scheme (such as AGNES).
Both k-means and k-medoids perform partitioning-based clustering. An advantage
of such partitioning approaches is that they can undo previous clustering steps (by
iterative relocation), unlike hierarchical methods, which cannot make adjustments
once a split or merge has been executed. This weakness of hierarchical methods can
cause the quality of their resulting clustering to suffer. Partitioning-based
9. Clustering has been popularly recognized as an important data
an application example that takes clustering as a major data mining
function.
An example that takes clustering as a major data mining
function
could
be a system that identifies groups of houses in a city according to house type,
value, and geographical
CLARANS
location. More specifically, a clustering algorithm like
can be used to discover that, say, the most
in Vancouver
can be grouped into just
expensive housing units
a few clusters.
10. Why is outlier mining important? Briefly describe the different
approaches behind statistical-based outlier detection.
Data
objects
remaining
that are
set
of data
grossly
are
different from,
inconsistent with,
the
called “outliers”. Outlier mining is useful for
detecting fraudulent activity (such as credit
customer
or
card or telecom fraud), as well as
segmentation and medical analysis.
Computer-based outlier analysis
may be statistical- based, distance-based, or deviation-based.
The statistical-based
the
given data
approach
set and
assumes a distribution or probability model for
then identifies
model using a discordancy test.
outliers. The
with
The discordancy test
distribution, distribution parameters (e.g.,
of expected
outliers
drawbacks
mean,
of this
variance),
method
are
respect
to
the
is based on data
and
the
number
that most
tests
are for single attributes, and in many cases, the data distribution may not be
known.
```