Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Correspondence analysis applied to microarray data • Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron • http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98(2001)-10781.pdf • www.pnas.org/cgi/doi/10.1073/pnas.181597298 Principal Component Analysis • Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data – The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) • Each data vector is a linear combination of the c principal component vectors • Project on the subspace which preserve the most of the data variability: 1 n I d ( xi , x ) n 1 i 1 2 2 d d ( xi , x ) ( xij x j ) 2 2 j 1 • Correspondence analysis CA= PCA for categorical variables Example:Dataset X -27 dog species 7 categorical variables Name Height Weight Speed Intelligence Affection Agresivity Function - + ++ - + ++ - + ++ - + ++ - + - + C H U (company,Hunt,Utility) 1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 … 27.Caniche 1 0 0 1 0 0 0 1 0 =X 001 0 1 1 0 1 0 0 CA study dependence between 2 categorical variables -Height,Function Works on crossable N Height/Function C + ++ H U 6 1 0| 7 3 2 0| 5 1 6 8| 15 ----------10 9 8| 27 C H U marginals n11 n12 n13 |n1+ N= n21 n22 n23 |n2+ n31 n32 n33 |n3+ --------------n+1 n+2 n+3 |n + ++ marginals Correspondence analysis • Divide each line by its total Height/Function C H U 6/7 1/7 0/7| 7 + 3/5 2/5 0/5| 5 ++ 1/5 6/5 8/5| 15 C H U n11/n1+ n12/n1+ n13/n1+ n21/n2+ n22/n2+ n23/n2+ n31/n3+ n32/n3+ n33/n3+ Each row become a point in probability space over the categories of the second variable (conditional distributions given the category value of the first variable) # points =#categories of first variable =m dimension of space = #number of categories of second variable l distance between points (probabilities) –weighted Euclidian distance –low when indep (can transform data and work with usual Euclidian distance) nih n jh 2 ( ) l n n j d (i, j ) 2 i n h h 1 n Correspondence analysis • Each row I considered with weight= ni n • Measure for total variability I= chi-square statistic =measure of dependence between the two variables I 2 m i 1 2 ni d (i, g ) n I2 1 n m l i 1 h 1 m l i 1 h 1 ni n n n ( ih h ) 2 n n h ni n ni n h 2 ) 1 2 n ni n j n n ( nih CA –visualization of the cell that contribute most to dependence: if an n_ih has an outstanding value then both line i and column h will be far from g in the same direction Correspondence analysis Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability (dependence) New variable =linear combinations of the initial ones Like in PCA -solutions in term of eigenvalues/eigenvectors of N -eigenvalue –gives proportion of variability preserved -measures for how well each point is represented in the subspace -measures for contribution of each point/category in determining the optimal subspace subspace “meaning” Height/Function C H U CA1 CA2 C H U 6/7 1/7 0/7| 7 1.10 -.92 n11/n1+ n12/n1+ n13/n1+ + 3/5 2/5 0/5| 5 0.85 1.23 n21/n2+ n22/n2+ n23/n2+ ++ 1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+ Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function) If join two “identical categories” the chi2- distance do not change Repeat everything for transpose(N) Height/Function + ++ C H 6 1 3 2 1 6 10 9 U 0| 0| 8| 8 Function/Height + ++ CA1 CA2 C 6/10 3/10 1/10 1.04 -.10 H 1/9 2/9 6/9 -0.32 .43 U 0/8 0/8 0/9 -0.94 -.37 Each column become a point in probability space over the categories of the first variable (conditional distributions given the category value of the second variable) # points =#categories of second variable =l dimension of space = #number of categories of first variable m CA- New variables =linear combinations of the initial ones preserving dependence best Close CA1 (and good points representation in subspace) means similar category (ease to visualizeidentify similar categories of the second variable (Function) in the low dimensional plot) (U,H functions have similar heights) Overlap the two plots Function/Height + ++ CA1 CA2 C 6/10 3/10 1/10 1.04 -.10 H 1/9 2/9 6/9 -0.32 .43 U 0/8 0/8 0/9 -0.94 -.37 CA1 1.10 .85 -.84 CA2 -.92 1.93 0.02 CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to the conditional probability: 1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1) Categories of different variable close to the extreme of the axes and to each other are highly correlated: Utility dogs are big; Company dogs are small (see also shaving gene classification) “close to the extremes of the axes” If reorder the rows and columns by first CA – generally cells with high values go on diagonal Height/Function C H U CA1 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94 Extension • Treat X as N (crossable for two possible variable with 27 respective 6 categories Name Height Function - + ++ C H U 1. Boxer 0 1 0 1 0 0 … =X 27.Caniche 1 0 0 1 0 0 - + ++ C H 0 1/2 0 ½ 0 U CA1 CA2 0 .45 .88 ½ 0 0 ½ 0 0 .91 .02 CA1 1.2 .85 -.84 1.04 -.32 -.4 Plot from transpose(X) identical to overlapped plots above New plot from X – extra points for each dog race Relationship Height/Function- Dog race: Canish is small dog for company Multiple correspondence analysis Use the whole X (all the variables) as crosstable Name Height Weight - + ++ - + ++ 1 . Boxer 0 1 0 0 1 0 0 Speed Intelligence Affection Agresivity Function + ++ + ++ - + - + C H U 1 0 0 1 0 0 1 0 1 1 0 0 =X … 27.Caniche 1 0 0 CA1 CA2 10 0 0 1 0 0 0 1 0 1 .32 .60 -.89 -35 .37 -.34 -.84 .78 -1.04 .89 .37 -.81 .29 .46 -.29 .27 Discovering association rules (based on correlation): Company dogs are small, with high affectivity Utility dogs are big, fast, aggressive Hunt dogs are very intelligent Use them for classification 1 0 .40 -.43 .19 -.21 1 0 0 What Is Association Mining? • Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. – Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] • Motivation: finding regularities in data – What products were often purchased together? — Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Customer buys both Customer buys beer Customer buys diaper Itemset X={x1, …, xk} Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A C (50%, 66.7%) C A (50%, 100%) Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases • • • • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: – generate length (k+1) candidate itemsets from length k frequent itemsets, – test the candidates against DB Challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates Construct FP-tree From A Transaction Database – For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Association-Based Classification • Several methods for association-based classification – ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) • It beats C4.5 in (mainly) scalability and also accuracy – Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label – CAEP (Classification by aggregating emerging patterns) (Dong et al’99) • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another • Mine Eps based on minimum support and growth rate Table 1. Cell-cycle data as used in analysis Table 1. Cell-cycle data as used in analysis •The raw intensity data as obtained from ais imaging software (Imaging Research, St. Catherines, ON, Canada) were normalized •The normalized data matrix was filtered for genes with positive minmax separation for at least one of the conditions under study (2). •The data were shifted to a positive range by adding the minimum + 1 alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 … (M/G1) (M/G1) (G1) (G1) (S) (S) .. YHR126C 5.81 5.73 6.01 5.48 5.37 5.23 … YOR066W 5.62 5.81 6.02 5.28 5.02 5.23 … hxt4 5.78 6.21 6.02 5.5 5.58 5.21 … PCL9 4.64 5.39 4.89 5.19 4.96 5.62 … mcm3 5.38 5.8 6.13 5.74 4.52 5.22 … . . . . . . . . … * 800 genes X 73 hybridizations 4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu) Samples from each method are drawn and their cell-cycle phase had been classified –5 classes * link toward database with information (meaning, functionality etc) for each gene provided Each cell-cycle phase colored differently (M/G1),(G1),(S), (G2),(M) -can see that hybridization separate according to their cell-cycle phase (one phase = one region of the plot) - G1 phase strongly associated with histone gene cluster - cdc15-30 hybridization classified yellow behave green (located in green region) -cdc15-70 -cdc15-80 suggest improper phase classification for these samples (check with the profiles –proves correct)