Download Cluster Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Prof. Enza Messina
Lesson 1
Classification
Clustering

Objects characterized by one or
more features
Feature Y

Feature X

Objects characterized by one or
more features

Classification
◦ Have labels for some points
Feature Y

Feature X

Objects characterized by one or
more features

Classification
◦ Have labels for some points
◦ Want a “rule” that will accurately
assign labels to new points
◦ Supervised learning
Feature Y

Feature X

Objects characterized by one or
more features

Classification
◦ Have labels for some points
◦ Want a “rule” that will accurately
assign labels to new points
◦ Supervised learning
Feature Y

Feature X

Objects characterized by one or
more features

Classification
◦ Have labels for some points
◦ Want a “rule” that will accurately
assign labels to new points
◦ Supervised learning

Clustering
◦ No labels
◦ Group points into clusters based on
how “near” they are to one another
◦ Identify structure in data
Unsupervised learning
Feature Y

Feature X
Objects characterized by one or
more features

Classification
◦ Have labels for some points
◦ Want a “rule” that will accurately
assign labels to new points
◦ Supervised learning

Clustering
◦ No labels
◦ Group points into clusters based on
how “near” they are to one another
◦ Identify structure in data
Unsupervised learning
Feature Y

Feature X
Classification
Given a set of data vectors
, where d is the input space
dimensionality (number of features), they are mapped to a set of
class labels, represented as
, where C is the total
number of classes.
This mapping is modeled in terms of a mathematical function
, where w is a vector of adjustable parameters.
These parameters are determined (optimized) by a learning
algorithm, based on a dataset of input-output examples
Classification
Given a set of data vectors
, where d is the input space
dimensionality (number of features), they are mapped to a set of
class labels, represented as
, where C is the total
number of classes.
This mapping is modeled in terms of a mathematical function
, where w is a vector of adjustable parameters.
These parameters are determined (optimized) by a learning
algorithm, based on a dataset of input-output examples
In clustering, labeled data is unavailable!
• Clustering is a subjective process
• “In cluster analysis a group of objects is split up into a number
of more or less homogeneous subgroups on the basis of an
often subjectively chosen measure of similarity (i.e., chosen
subjectively based on its ability to create “interesting” clusters),
such that the similarity between objects within a subgroup is
larger than the similarity between objects belonging to different
subgroups” (Backer and Jain, 1981)
• A different clustering criterion or clustering algorithm, even for
the same algorithm but with different selection of parameters,
may cause completely different clustering results
• “A cluster is a set of entities which are alike, and entities from
different clusters are not alike.”
• “A cluster is an aggregate of points in the test space such that
the distance between any two points in the cluster is less than
the distance between any point in the cluster and any point not
in it.”
• “Clusters may be described as continuous regions of this space
( d-dimensional feature space) containing a relatively high
density of points, separated from other such regions by regions
containing a relatively low density of points. “
Given a set of input patterns
, where
, with each measure xji called a feature
(also attribute, dimension or variable).
Hard partitional clustering attempts to seek a K-partition of X, C =
{C1, …, CK) (K  N), such that:
•
•
•
It may also be possible that an object is allowed to belong to all K
clusters with a degree of membership,
, which
represents the membership coefficient of the jth object in the ith
cluster and satisfies the following constraints:
and
This is known as fuzzy clustering.
E.g. Principal
Component
Analysis
Partition-based?
Hierarchical?
Density-based?
Are the
clusters
meaningful?
A set of clusters is
not itself a finished
result but only a
possible outline.
Further
experiments may
be required
• Engineering: biometric and speech recognition, radar signal
analysis, information compression and noise removal;
• Computer Science: web mining, spatial database analysis,
information retrieval, image segmentation;
• Life and medical sciences: tanonomy definition, gene and protein
function identification, disease diagnosis and treatment;
• Social sciences: behavior pattern analysis, analysis of social
networks, study of criminal psychology;
• Economics: customer characterization, purchasing pattern
recognition, stock trend analysis.
• A data object is described by a set of features or
variables, usually represented as a multidimensional
vector
• A feature can be classified as:
• Continuous
• Discrete
• Binary
• Another property of features is the measurement
level, which reflects the relative significance of
numbers
• A feature can be:
• Nominal (labels without a specified order)
• Ordinal (labels with a specified order)
• Interval (numerical values, without a true zero – e.g. Celsius
degrees)
• Ratio (numerical values, with a true zero – e.g. Kelvin degrees)
• Missing data are quite common for real data sets due to all
kinds of limitations and uncertainties
• How can we deal with them?
• The first thought of how to deal with missing data may be to
discard the records that contain missing features
• This approach can only work when the number of objects that
have missing features is much smaller than the total number
of objects in the data set.
• Approach #1: calculating the proximity by only using the
feature values that are available:
where
is the distance between each pair of
components of the two objects, and
• Approach #2: calculate the average distance between all pairs
of data objects along all the features and use that to estimate
distances for the missing features.
The average distance for the lth feature is obtained as:
Therefore, the distance for the missing feature is obtained as:
• A dissimilarity or distance function on a data set X is defined
to satisfy the following conditions:
• Symmetry
• Positivity
• If the following also hold, then it is called a metric
• Triangle Inequality
• Reflexivity
• Likewise, a similarity function on a data set X is defined to
satisfy the following conditions:
• Symmetry
• Positivity
• If the following also hold, then it is called a similarity metric
• Triangle Inequality
• Reflexivity
Because
D(i,j) = 1/S(i,j)
• Perhaps the most commonly known is the Euclidean distance,
or L2 norm, represented as:
• This measure tends to form hyperspherical clusters
• If the features are measured with very different units, features
with large values and variances will dominate the others
• How can we deal with features with different scales?
• We standardize the data, so that each feature has zero mean
and unit variance (z-normalization):
• Another approach is based on the maximum and minimum of
the data, so that all features lie in range [0,1]
• The Euclidean distance can be generalized as a special case of
a family of metrics, called Minkowsky, or Lp norm:
• When p=2, this distance becomes the Euclidean distance
• When p=1, this distance is called Manhattan distance, or L1
norm.
• When p=∞, this is called sup distance or L∞ norm.
• The Mahalanobis distance is defined as:
where S is the within-class covariance matrix defined as
where μ is the mean vector and E the expected value.
• This distance is effective when features are correlated
• When the features are uncorrelated, this distance is
equivalent to the Euclidean distance
• It is computationally expensive for large-scale datasets
• The distance measure can also be derived from a correlation
coefficient, such as the Pearson correlation coefficient:
where
• The correlation coefficient is in the range [-1, 1], so the
distance measure is defined as:
• This measure disclose difference in shapes rather than detect
the magnitude of differences between the two objects.
• One of the most used similarity measure for continuous
variables is the cosine similarity:
• If two objects are similar, they will be more parallel in the
feature space, and the cosine value will be greater.
• Similar to Pearson correlation coefficient, this measure is
unable to provide information on the magniture of differences
• Cosine similarity can be constructed as a distance measure by
simply using D(xi,xj) = 1 – S(xi,xj)
• Similarity measures are most commonly used for discrete
features, and many of them take only two values
• Binary features can be classified as:
• Symmetric : both values are equally significant
• Asymmetric : one value (often represented as 1) is more important than
the other
• Examples:
• The feature of gender is symmetric (“female” and “male” can be encoded
as 1 or 0 without affecting the evaluation of the similarity
• The presence/absence of a rare form of a gene can be asymmetric (the
presence may be more important then the absence)
• Similarity measures for symmetric binary variables are:
• Different values of w have been proposed:
• W = 1 , simple matching coefficient
• W = 2 , Rogers and Tanimoto
• W = 1/2 , Sokal and Sneath
• These invariant similarity measures regard the 1-1 match and
0-0 match as equally important
• The corresponding dissimilarity measure is known as
Hamming distance
• Similarity measures for asymmetric binary variables are:
• Different values of w have been proposed:
• W = 1 , Jaccard coefficient
• W = 2 , Sokal and Sneath
• W = 1/2 , Dice
• These non-invariant similarity measures focus on the 1-1
match while ignoring the effect of 0-0 match, which is
considered uninformative
• For discrete features that have more than two states, a simple
and direct strategy is to map them into a larger number of
new binary features
• These new binary features are asymmetric
• This may introduce too many binary variables
• Are there other strategies?
• Simple Matching Criterion:
where
• Sometimes, categorical features display certains orders
• The codes from 1 to Ml (where Ml is the number of levels or
the highest level for the feature l) are no longer meaningless
• The closer two levels are, the more similar the two objects
will be with regards to that feature
• Since the number of possible level varies with the features,
the levels are converted in the range [0,1]:
• For real data sets is common to see both continuous and
categorical features at the same time
• Generally, we can map all features into the variables [0,1] and
apply measures such as Euclidean distance
• But this is unfeasible for categorical variables whose classes
are just names without any meaning
• We can transform all features into binary variables and only
use binary similarity functions, but this leads to information
loss
• A more powerful method was proposed by Gower
where Sijl indicates the similarity for the lth feature, and δ is a
0-1 coefficient to track if a feature is missing
For discrete variables:
• For continuous variables:
• where Rl is the range of the lth variable (max – min)