Download Multivariate Distance and Similarity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Transcript
Multivariate Distance and
Similarity
Robert F. Murphy
Cytometry Development
Workshop 2000
General Multivariate Dataset
 We
are given values of p variables for n
independent observations
 Construct an n x p matrix M consisting
of vectors X1 through Xn each of length
p
Multivariate Sample Mean
 Define
mean vector I of length p
n
I( j) 
n
 M(i, j)
or
i1
n
matrix notation
I
 Xi
i1
n
vector notation
Multivariate Variance
 Define
variance vector s2 of length p
2
n
s ( j) 
2
 M(i, j)  I( j)
i1
n 1
matrix notation
Multivariate Variance
 or
2
n
s 
2
 X i  I
i1
n 1
vector notation
Covariance Matrix
 Define
a p x p matrix cov (called the
covariance matrix) analogous to s2
n
cov( j,k) 
 M(i, j)  I( j )M(i,k)  I(k)
i1
n 1
Covariance Matrix
 Note
that the covariance of a variable
with itself is simply the variance of that
variable
cov( j, j)  s ( j)
2
Univariate Distance
 The
simple distance between the values
of a single variable j for two
observations i and l is
M(i, j)  M(l, j)
Univariate z-score Distance
 To
measure distance in units of standard
deviation between the values of a single
variable j for two observations i and l we
define the z-score distance
M(i, j)  M(l, j)
s ( j)
Bivariate Euclidean Distance
 The
most commonly used measure of
distance between two observations i
and l on two variables j and k is the
Euclidean distance
M(i, j)  M(l, j)  M(i,k)  M(l,k )
2
2
Multivariate Euclidean
Distance
 This
can be extended to more than two
variables
p
 M(i, j)  M(l, j)
j1
2
Effects of variance and covariance
on Euclidean distance
B
A
The ellipse
shows the
50% contour
of a
hypothetical
population.
Points A and B have similar Euclidean distances from the mean,
but point B is clearly “more different” from the population than
point A.
Mahalanobis Distance
 To
account for differences in variance
between the variables, and to account for
correlations between variables, we use
the Mahalanobis distance
D  X i  X l cov X i  Xl 
2
-1
T