Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Determinant wikipedia , lookup

System of linear equations wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Jordan normal form wikipedia , lookup

Four-vector wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Matrix calculus wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Matrix multiplication wikipedia , lookup

Ordinary least squares wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Artificial Intelligence
(Dis)similarity Representation
and Manifold Learning
Andrea Torsello
Multidimensional Scaling (MDS)
●
It is often easier to provide distances
●
What can we do if we only have distances
●
Multidimensional scaling
●
Compute the low dimensional representation φ ∈ ℝq that most faithfully preserves pairwise distances of data x ∈ ℝm
(or similarities which is inversely proportional to distances)
●
Euclidean distance (squared) between two points
●
Assume,without loss of generality, that the centroid of the configuration of n points is at the origin
●
We have
●
●
We can define the n × n Gram matrix G of inner products from the matrix of
squared euclidean distances
This process is called centering
●
Define matrix A as aij=dij2, the Gram matrix G is
G = HAH
where H is the centering matrix,
●
If d is a squared Euclidean distance, the Gram matrix is positive definite
(linear kernel)
●
G can be written in terms of spectral decomposition
●
Thus
●
●
modulo isometries.
We can have an exact Euclidean embedding of the data if and only the Gram
matrix is positive semidefinite.
In general, use only the k largest positive eigenvectors
●
Minimize
●
What can we do if the data cannot be embedded in a Euclidean space?
●
The gram matrix has negative eigenvalues
1. Ignore: take only positive eigenvalues
2. Correct: change the distances so that eigenvalues are all positive
●
We can add a constant to all distances, In fact if G is a Gram matrix
then the matrix
matrix of the squared distance matrix
3. Other...
is positive semidefinite and is the Gram
Relations to learning algoritms
●
Embedding transformations are related to properties of learning algorithms
●
E.g., consider k-means clustering algorithm
●
●
●
Since the centroids are linear with respect to the points' positions, we can
express the distance to the centroids in terms of the distances between the
points
or
Thus, its energy function can be expressed in terms of pairwise distance data
●
●
It is easy to see that a constant shift in the distances does not affect the
minimizers of the distortion J
To cluster data expressed in terms of non-euclidean data we can perform a shift
embedding without affecting the cluster structure.
●
In many cases, if we transform the feature values in a non-linear way, we can transform a problem
that was not linearly separable into one that is.
●
●
Similarly, when training samples are not separable in the original space they may be separable if
you perform a transformation into a higher dimensional space
●
To get to the new feature space, use the function (x)
●
●
The transformation can be to a higher-dimensional space and can be non-linear
You need a way to compute dot products in the transformed space as a function of vectors in the
original space
●
●
If dot product can be efficiently computed by
(xi)(xj) = K(xi,xj)
all you need is a function K on the low-dimensional inputs
You don't ever need to compute the high-dimensional mapping (x)
●
A function K : XXℝ is positive-definite if for any set {x1,...,xn}⊆X, then the matrix
is positive definite
●
●
●
Mercer's Reproducer Theorem:
Let K : XXℝ be a positive-definite function, the there exist a (possibly infinite-dimensional)
vector space Y and a function  : XY such that
K(x1,x2)=(x1)(x2)
This means that you can substitute dot products with any positive-definite function K (called kernel)
and you have an implicit non-linear mapping to a high-dimensional space
If you chose your kernel properly, your decision boundary bends to fit the data.
Kernels
There are various choices for kernels
●
Linear kernel
K(xi,xj) = xixj
●
Polynomial kernel
K(xi,xj) = (1 + xixj)n
●
Radial Basis Function
K(xi,xj) = exp(-½|xi-xj|/2)
●
Polynomial kernel n=2
●
Equivalent to mapping
●
We can verify that
Radial Basis Kernel
●
Classifier based on sum of Gaussian bumps centered on support vectors
Example Separation Boundaries
Manifold learning
●
Find a low-D basis for describing high-D data.
●
uncovers the intrinsic dimensionality (invertible)
●
Why?
●
●
●
●
●
data compression
curse of dimensionality
de-noising
visualization
reasonable distance metrics
Example
●
appearance variation
Example
●
Deformation
Reasonable distance metrics
●
What is the mean position in the timeline?
Linear interpolation
Manifold interpolation
Kernel PCA
●
PCA cannot handle non-linear manifolds
●
We have seen that the “Kernel trick” can “bend” linear classifiers
●
Can we use the kernel trick with PCA?
●
Assume that the data already has mean 0
●
The Covariance matrix is
●
Not a dot product!
●
Assume a non-linear map  onto an M-dimensional space that maps x onto barycentric coordinates
●
In that space the sample covariance is
●
Assume C has eigenvectors
●
We have
from which we conclude that v can be expressed as a linear combination of
●
Substituting we have:
●
Left-multiplying by
(xl) and setting k(xi,xj)=(xi)T(xj), we have
or, in matrix notation
(xi)
●
Solving the eigenvalue equation
you find the coefficients ai
●
The normalization condition is obtained from the normality of the vi:
●
The projections onto the principal components are
●
We have assumed that
●
●
Let
 maps x onto barycentric coordinates
We want to generalize it
 be any map and
a corresponding centralized map.
●
The corresponding centralized kernel will be
or, in matrix notation
Examples
●
Gaussian kernel
Problems with kernel PCA
●
Kernel PCA uses the kernel trick to find latent coordinates of a non-linear manifold
●
However:
●
It works on the nxn matrix
●
rather than the DxD covariance matrix
●
Problem for large datasets
●
On the other hand, non linear generalization requires large amount of data
Can map from data to the latent space, but not viceversa:
●
A linear combination of elements on a non-linear manifold might not lie on the manifold
Isomap
●
●
2. Build a sparse graph with K-nearest neighbors
2. Infer other interpoint distances by finding shortest paths on the graph
(Dijkstra's algorithm).
●
3. Build a low-D embedded space to best preserve the complete distance matrix.
MDS
Laplacian Eigenmap
●
●
●
Search for an embedding that penalizes large distances between strongly
connected nodes
Where D is the usual degree matrix
●
●
the constraint yTDy=1 removes the arbitrary scaling factor in the embedding.
Note that the optimization problem can be cast in term of the generalized
eigenvector problem
where L=D-W is the Laplacian, since
Locally Linear Embedding (LLE)
●
Find a mapping to preserve local linear relationships between neighbors
Compute weights
●
●
For each data point i in D dims, we find the K nearest neighbors
Compute a kind of local principal component plane to the points in the
neighborhood, minimizing
over weights Wij satisfying
●
●
Wij is the contribution of point j to the reconstruction of point i.
Least square solution obtained by solving
where
●
weight-based representation has several desirable invariances
●
●
●
●
It is invariant to any local rotation or scaling of xi and its neighbors
on or scaling of xi and its neighbors (due to the linear relationship
the normalization requirement on Wij adds invariance to translation
The mapping preserves angle and scale within each local neighborhood.
●
●
Having solved for the optimal weights which capture the local structure, find new
locations which approximate those relationships.
This can be done minimizing the same quadratic cost function for the new data
locations:
●
In order to avoid spurious solution we add the constraints
●
We can rewrite the problem as
●
Solved by the d smallest eigenvectors
●
1TY=0 implies that you must skip the smallest.
●
Isomap: pro and con
●
●
●
●
●
●
preserves global structure
long-range distances become more
important than local structure
few free parameters
sensitive to noise, noise edges
computationally expensive (dense
matrix eigen-reduction)
data
LLE: pro and con
●
●
●
●
●
isomap
no local minima, one free
parameter
incremental & fast
simple linear algebra operations
can distort global structure
Preserves local structure over
long-range distances
LLE
●
No matter what your approach is, the “curvier” your manifold, the denser your
data must be