Download 10. C10-Distance Measure

Document related concepts

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Artificial Intelligence
Distance Measure
Dae-Won Kim
School of Computer Science & Engineering
Chung-Ang University
Data mining is the process of
discovery of unknown relationship
between patterns.
Thus, many data mining techniques
are based on similarity between
patterns.
We need to define what we mean
by similar, so that we can calculate
formal similarity measures.
Similarity is obtained from vectors
of measurements (features)
describing each pattern.
Instead of talking about how similar
two objects are, we could talk about
how dissimilar they are.
We can easily define the other by
applying a monotonically decreasing
transformation.
e.g.) Similarity S(x,y) is given,
 Dissimilarity D(x,y)=1 - S(x,y)
 Dissimilarity D(x,y)=1 / S(x,y)
The term distance is often used
informally to refer to a dissimilarity
measure.
The Euclidean distance:
D( x, y) 
f
 x(i)  y(i)
2
i 1
It is simple and effective, and is
called metric.
Q: What happens if it’s not a metric.
A metric is a dissimilarity measures
that satisfies three conditions:
1. D(x,y)  0 and D(x,x)=0
2. D(x,y) = D(y,x)
3. D(x,y)  D(x,k)+D(k,y)
(triangle inequality)
However, the Euclidean distance has
some limitations.
Limit 1. What if the features were
measured using different units?
(e.g., length, weight)
We can make all features equally
important by using normalization
and standardization.
Or, if we have some idea of the
relative importance for each feature,
then we can weight them.
The weighted Euclidean distance:
D( x, y) 
f
 w(i)x(i)  y(i)
2
i 1
Limit 2. What if the shape of each
class is not hyper-sphere?
Two pairs of data in the same class
can yield different distance values.
The Mahalanobis distance:
D( x, y)  ( x(i)  y(i))T  1 ( x(i)  y(i))
The Euclidean distance is
generalized to the Minkowski (L).
1

 f

D( x, y)    x(i)  y(i)  
 i 1

The Euclidean distance is the
special case of =2.
 f
2
D( x, y )    x(i)  y (i)  
 i 1

1
2
Sample (Pearson) correlation
coefficient:

x (i)  x y (i)  y 

i
f
D (x ,y ) 
1

x (i)  x   y (i)  y 

i
i
f
1
2
f
1
2
The Cosine distance:
f
x y
D (x ,y ) 

x y
x (i)y (i)

i
1
f
2
x
(
i
)

i1
f
2
y
(
i
)

i1
Q1: Euclidean vs. Correlation
vs. Cosine distance
Q2: Which one of the three is used
in Google?
We are now discussing how to
compute the similarity/distance
between two categorical patterns.
For binary categorical data, we can
count the number of features on
which two patterns take the same
or take different values.
Rather than measuring the
dissimilarities between patterns, we
often measure the similarities.
The simple matching coefficient:
S ( x, y ) 
nx:1, y:1 (i )  nx:0, y:0 (i )
nx:1, y:1 (i )  nx:1, y:0 (i )  nx:0, y:1 (i )  nx:0, y:0 (i )
It may be inappropriate to include
the (x:0,y:0) or (x:1,y:1) depending
on the meaning of 0 and 1.
The Jaccard coefficient:
S ( x, y ) 
nx:1, y:1 (i )
nx:1, y:1 (i )  nx:1, y:0 (i )  nx:0, y:1 (i )
How about categorical data in which
the features have more than two
categories?
The Hamming distance:
f
D( x, y )   h( x(i ), y (i ))
i 1
0
h( x(i ), y (i ))  
1
if x(i )  y (i )
otherwise
For example, what is your approach
to the mixed-type (numeric and
categorical features) data ?
You need to develop a good
distance measure for mixed-data.
Approach 1. Preprocessing
Preprocess the mixed data into a
single-type data.
1. discretize numeric features or
2. encode categorical features as
numeric integer values.
However, this approach often leads
to loss of important information.
Approach 2. Mixed-data distance
Let us mix two distance measures.
D(x,y)= Euclidean(x,y) +
Hamming(x,y)
What would be issues?
D(x,y)=Normalized Euclidean(x,y) +
Normalized Hamming(x,y)
Tip: HVDM will be a starting point
(Heterogeneous Value Difference Metric)
Two values are considered to be
closer if they have more similar
classifications (labels)
A feature color has three values red,
green and blue when identifying if
an object is an apple.
Red and green would be considered
closer than red and blue because
the former two both have similar
correlations with the output
class apple.
Q: Distance measure in Project II-2
The 1st trial: 1NN using Euclidean
distance for fixed-length data
Two same characters can show
different signal shapes in time and
magnitude.
Is the Euclidean distance working?
Two signals should be aligned and
stretched in each axis.
Of the algorithms you are familiar
with, which one is best to calculate
the distance between two patterns?
Dynamic Programming
Dynamic Time Warping
The DTW is the most widely used
algorithm to measure similarity
between two given signals that vary in
time or speed by warping the signals
non-linearly in the time dimension
As expected, it is easy to implement
(dynamic programming).
DTW for 1-D time-series is simple.
Tip.
Project-II contains 3-D time-series.
We are expecting your good job.