Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principal Component Analysis PCA Example: FOODS • 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Germany Italy France 90 49 88 Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish 19 57 51 19 21 27 PCA Example: FOODS PCA Example: FOODS PCA Example: Red Sox Dataset: 110 Years of Redsox Performance Data Question: Pitchers and Batters Ages Matter for Performance Redundancy • Arbitrary observations by r1 and r2 • Low to high redundancies from (a) to (c) • (c) can be represented by a single variable • Spread across the best-fit line – covariance between two variables Transform • Linear Transformation • (x,y) in Cartesian coordinate • The same point becomes in (a,b) in another coordinate system • Assuming linear transformation • a = f(x, y) = x*c11 + y*c12 • b = g(x,y) = x*c21 + y*c22 æ a ö æ c11 ç ÷ =ç è b ø è c21 c12 c22 öæ x ö ÷÷ ÷çç øè y ø For review of matrix, www.cs.uml.edu/~kim/580/review_matrix.pdf Eigenvector æ 2 3 öæ 1 ö æ ö ç ÷ç ÷ = ç 11 ÷ è 2 1 øè 3 ø è 5 ø æ 2 3 öæ 3 ö æ æ 3 ö ö 12 ç ÷ç ÷= ç ÷ ÷ = 4* ç è 2 1 øè 2 ø è 8 ø è 2 ø • Eigenvector – projection to the same coordinate • Eigenvectors of a square matrix are orthogonal • Unique eigenvalues are associated with eigenvectors Transform • http://www.ams.org/samplings/feature-column/fcarc-svd æ 3 0 öæ x ö ÷÷ = ç ÷çç è 0 1 øè y ø æ 3x ö çç ÷÷ è y ø Transform • http://www.ams.org/samplings/feature-column/fcarc-svd æ 2 1 ö M =ç ÷ è 1 2 ø (Symmetric) Eigenvectors • Mv = λiv λi is scalar v is orthogonal vectors • Non-symmetric æ 1 1 ö M =ç ÷ è 0 1 ø • For unit vectors u1 and u2 • A general vector x has coefficients projected by unit vectors • • Vector product is of the form: • • => • • SVD (Singular Vector Decomposition) • M: mxn; U: mxm; Σ: mxn; V: nxn • U is eigenvector of MMT • V is eigenvector of MTM Transform = New Coordinate • What should be good for transform matrix for PCA ? • Covariance Mean, Variance, Covariance • X = (x1, x2, ….. xn) Y = (y1, y2, …..yn) • E[X] = ∑i xi /n E[Y] = ∑I yi /n • Variance = (st. dev.)2: • V[X] = ∑I (xi – E[X])2 / (n-1) • Covariance -• cov[X,Y] = ∑I (xi – E[X]) (yi – E[Y]) / (n-1) Covariance Matrix • Three variables X,Y,Z cov[x,x] cov[x,y] cov[x,z] cov[y,x] cov[y,y] cov[y,z] cov[z,x] cov[z,y] cov[z,z] • cov[X,X] = V[X] • cov[X,Y] = cov{Y,X] PCA Numerical Example X Y 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2.0 1.6 1.0 1.1 1.5 1.6 1.1 0.9 167 749 13l92 62.42 X (adj) • After adjustments Y (adj) 0.69 0.49 -1.31 -1.21 0.39 0.99 0.09 0.29 1.29 1.09 0.49 0.79 0.19 -0.31 -0.81 -0.81 -0.31 -0.31 -0.71 -1.01 • Covariance matrix cov = (.6165 .6154) .6154 .7166 • Eigenvalues |.6165-λ |.6154 .6154 | .7166-λ| =0 1.2840, 0.0491 • Eigenvectors Example: Amino Acid (AA) - Basic Clustering of AAs How many clusters ? Use 4 AA groups Good for acidic and basic P in polar group Nonpolar group is wide spread Similarities of AA’s determine the ease of substitutions Some alignment tools show similar AA’s in colors Needs a more systematic approach Physico-Chemical Properties Physico-chemical properties of AA determine protein structures (1) Size in volume (2) Partial Vol. Measure expanded volume in solution when dissolved (3) Bulkiness The ratio of side chain volume to its length: average crosssectional area of the side chain (4) pH of isoelectric point of AA (pI) (5) Hydrophobicity (6) Polarity index (7) Surface area (8) Fraction of area Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures Vol. Bulk Pol. pI Hydro Surf2 Frac Alanine Ala A 67 11.5 0.0 6.0 1.8 113 0.74 Arginine Arg R 148 14.3 52.0 10.8 -4.5 241 0.64 Asparagine Asn N 96 12.3 3.4 5.4 -3.5 158 0.63 Aspartic Asp D 91 11.7 49.7 2.8 -3.5 151 0.62 Cysteine Cys C 86 13.5 1.5 5.1 2.5 140 0.91 Glutamine Gln Q 114 14.5 3.5 5.7 -3.5 189 0.62 Glu. Acid Glu E 109 13.6 49.9 3.2 -3.5 183 0.62 Glycine Gly G 48 3.4 0.0 6.0 -0.4 85 0.72 Histidine His H 118 13.7 51.6 7.6 -3.2 194 0.78 Isoleucine Ile I 124 21.4 0.1 6.0 4.5 182 0.88 Leucine Leu L 124 21.4 0.1 6.0 3.8 180 0.85 Lysine Lys K 135 13.7 49.5 9.7 -3.9 211 0.52 Methionine Met M 124 16.3 1.4 5.7 1.9 204 0.85 Phenyl. Phe F 135 10.8 0.4 5.5 2.9 218 0.88 Proline Prot P 90 17.4 1.6 6.3 -1.6 143 0.64 Serine Ser S 73 9.5 1.7 5.7 -0.8 122 0.66 Threonine Thr T 93 15.8 1.7 5.7 -0.7 146 0.70 Tryptophan Trp W 163 21.7 2.1 5.9 -0.9 259 0.85 Tyrosine Thr Y 141 18.0 1.6 5.7 -1.3 229 0.76 Valine Val V 105 21.6 0.1 6.0 4.2 160 0.86 109 15.4 13.6 6.0 -0.5 175 0.74 Mean Red: acidic Orange: basic Green: polar (hydrophillic) Yellow: non-polar (hydrophobic) PCA of AAs How to incorporate different properties In order to group similar AA’s Visual clustering with Volume and pI PCA Given NxP matrix (e.g., 20x7), Each row represents a p-dimensional data point Each data point is Scaled and shifted to the origin Rotated to spread out points as much as possible Scaling For property j, compute the average and the s.d. μj = ∑i xij /N, σj2 = ∑i (xij - μj)2 /N Since each property has a different scales and means, define normalized variables, zij = (xij - μj) /σj zij measures the deviation from the mean for each property with the mean of 0 and s.d. of 1 PCA New orthogonal coordinate system Find vj = (vj1, vj2 ,…, vjP ) such that ∑k vik vjk = 0 for i ≠ j (orthogonal) and ∑k vjk2 = 0 (unit length) vj represents new coordinate vector Data points in z-coordinate becomes yij = ∑k zjk vik New y coordinate systems is a rotation of the z coordinate system vjk turns out to be related to the correlation coefficient PCA Correlation coefficient, Cij Cij = ∑k(zik - mi)(zjk - mj) /Psi sj (mi, si mean and s.d. of the i-th row) -1 ≤ Cij ≤ 1 Results in NxN simiarlity matrix, Sij Vol Bulk Polar pI Hyd SA FrA Vol Bulk Polar pI Hyd SA FrA 1.00 9.73 0.24 0.37 -0.08 0.99 0.18 1.00 -0.20 0.08 0.44 0.64 0.49 1.00 0.27 -0.69 0.29 -0.53 1.00 -0.20 0.36 -0.18 1.00 -0.18 0.84 1.00 0.12 1.00 Clustering Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution Criteria needed Closeness between sequences The number of clusters Hierarchical Clustering Algorithm – connectivity-based K-mean -- centroid Hierarchical Clustering Hierarchical Clustering Algorithm Each point forms its own cluster, initially Join two clusters with the highest similarity to form a single larger cluster Recompute similarities between all cluster Repeat two steps above until all points are connected to clusters Criteria of similarities ? Use scaled coordinates z Vector zi from origin to each data point i with length |zi|2 = ∑k zik2 Use cosine angle between two points for similarity cos θij = ∑k zikzjk / |zi||zj| N elements, nxn distrance matrix d Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2 Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T k-mean Clustering The number of clusters, k, is known ahead of the time Minimize the squared errors between data points and k cluster centers k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion, d(V, X) = ∑id(vi,X)2/n input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X) over all possible choices of X No known polynomial algorithm Heuristics – Lloyd algorithm initially partition n points arbitrarily to k centers, then move some points between clusters Converge to a local minimum, may move many points in each iteration k-mean Clustering Assume every possible partition of n elements to k clusters And each partition has cost(P) Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P Move one point in each iteration Dynamic Modeling in Chameleon Similarity between clusters is determined by Relative interconnectivity (RI) Relative closeness (RC) Select pairs with high RI and RC to merge Hierarchical Clustering - Cluto Generates a set of clusters within clusters Algorithm can be arranged as a tree Each node becomes where two smaller clusters join CLUTO package with cosine and groupaverage rules Red/green indicates values significantly higher/lower than the average Dark colors close to the average Clustering of properties: properties can be ordered illustrating groups of properties that are correlated 1. red on both pI and polarity scale 2. green on hydrophobicity and pI (can be separated into two smaller clusters) green on volume and surface area 3. 4. 5. 6. C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues) Hydrophobic Two largest AA’s 6 clusters cluster Property AA 1 Basic K, R, H 2 Acid and amide E, D, Q, N 3 Small P, T, S, G, A 4 Cysteine C 5 Hydrophobic V, L, I, M, F 6 Large, aromatic W, Y Dayhoff Clustering - 1978 In PAM matrix, considered probabilities of pairs of amino acids appearing together Pairs of amino acids that tend to appear together are grouped into a cluster six clusters (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY) Contrast to clusters via hierarchical clustering (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY) 2nd Base U 1 s t B a s e C A G U UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG C F L I M V F I I I I UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG A S P T A A A A A UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAU GAU GAC GAA GAG G Y H Q N K D E Y H D D H D D UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY) C W R S R G C U C A W G U R C A G G U C R A G U G C A G Murphy, Wallqvist, Levy, 2000 To study protein folding Used BLOSUM50 similarity matrix Determine correlation coefficients between similarity matrix elements for all pairs of AA’s e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group