Download Self-Organizing maps - UCLA Human Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Four-vector wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Ordinary least squares wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Self-Organizing Maps
• Projection of p dimensional observations to a two (or one)
dimensional grid space
• Constraint version of K-means clustering
– Prototypes lie in a one- or two-dimensional manifold (constrained
topological map; Teuvo Kohonen, 1993)
• K prototypes: Rectangular grid, hexagonal grid
• Integer pair lj  Q1 x Q2, where Q1=1, …, q1 & Q2=1,…,q2
(K = q1 x q2)
• High-dimensional observations projected to the twodimensional coordinate system
SOM Algorithm
• Prototype mj, j =1, …, K, are initialized
• Each observation xi is processed one at a time to find the closest
prototype mj in Euclidean distance in the p-dimensional space
• All neighbors of mj, say mk, move toward xi as
mk  mk + a (xi – mk)
• Neighbors are all mk such that the distance between mj and mk are
smaller than a threshold r (neighbor includes itself)
– Distance defined on Q1 x Q2, not on the p-dimensional space
• SOM performance depends on learning rate a and threshold r
– Typically, a and r are decreased from 1 to 0 and from R
(predefined value) to 1 at each iteration over, say, 3000 iterations
SOM properties
• If r is small enough, each neighbor contains only one point
 spatial connection between prototypes is lost 
converges at a local minima of K-means clustering
• Need to check the constraint reasonable: compute and
compare reconstruction error e=||x-m||2 for both methods
(SOM’s e is bigger, but should be similar)
Tamayo et al. (1999; GeneCluster)
• Self-organizing maps (SOM) on microarray data
• - Hematopoietic cell lines (HL60, U937, Jurkat,
and NB4): 4x3 SOM
• - Yeast data in Eisen et al. reanalyzed by 6x5
SOM
Principal Component Analysis
• Data xi, i=1,…,n, are from the p-dimensional space (n  p)
– Data matrix: Xnxp
• Singular decomposition X = ULVT, where
– L is a non-negative diagonal matrix with decreasing diagonal
entries of eigen values (or singular value) li,
– Unxp with orthogonal columns (uituj = 1 if ij, =0 if i=j), and
– Vpxp is an orthogonal matrix
•
The principal components are the columns of XV (=UL)
– X and V have the same rank, at most p of non-zero eigen values
PCA properties
• The first column of XV or DU is the 1st principal
component, which represents the direction with the largest
variance (the first eigen value represents its magnitude)
• The second column is for the second largest variance
uncorrelated with the first, and so on.
• The first q columns, q < p, of XV are the linear projection
of X into q diensions with the largest variance
• Let x = ULqVT, where Lq is the diagonal matrix of L with
q non-zero diagonals  x is best possible approximate of
X with rank q
Traditional PCA
• Variance-Covariance matrix S from data Xnxp
– Eigen value decomposition: S = CDCT, with C an orthogonal
matrix
– (n-1)S = XTX = (ULVT)T ULVT = VLUT ULVT = VL2VT
– Thus, D = L2/(n-1) and C = V
Principal Curves and Surfaces
• Let f(l) be a parameterized smooth curve on the pdimensional space
• For data x, let lf(x) define the closest point on the curve to
x
• Then f(l) is the principal curve for random vector X, if
f(l) = E[X| lf(X) = l]
• Thus, f(l) is the average of all data points that project to it
Algorithm for finding the principal curve
• Let f(l) have its coordinate f(l) = [f1(l), f2(l), …, fp(l)]
where random vector X = [X1, X2, …, Xp]
• Then iterate the following alternating steps until converge:
– (a) fj(l)  E[Xj|l(X) = l], j =1, …, p
– (b) lf(x)  argminl ||x – f(l)||2
• The solution is the principal curve for the distribution of X
Multidimensional scaling (MDS)
• Observations x1, x2, …, xn in the p-dimensional space with
all pair-wise distances (or dissimilarity measure) dij
• MDS tries to preserve the structure of the original pairwise distances as much as possible
• Then, seek the vectors z1, z2, …, zn in the k-dimensional
space (k << p) by minimizing “stress function”
– SD(z1,…,zn) = ij [(dij - ||zi – zj||)2]1/2
– Kruskal-Shephard scaling (or least squares)
– A gradient descent algorithm is used to find the solution
Other MDS approaches
• Sammon mapping minimizes
– ij [(dij - ||zi – zj||)2] / dij
• Classical scaling is based on similarity measure sij
– Often inner product sij = <xi-m, xj-m> is used
– Then, minimize i j [(sij - <zi-m, zj-m>)2]