Download Gene - University at Buffalo

Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004 University at Buffalo The State University of New York Microarray Data Analysis  Analysis from two angles  sample as object, gene as attribute  gene as object, sample/condition as attribute University at Buffalo The State University of New York Supervised Analysis     Select training samples (hold out…) Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes Class 1 Class 2 g1 1 1 … 1 0 0 … 0 g2 1 1 … 1 0 0 … 0 . . . . . . . g4131 0 0 … 0 1 1 … 1 g4132 0 0 … 0 1 1 … 1 University at Buffalo The State University of New York g1 1 1 … 1 0 0 … 0 g2 1 1 … 1 0 0 … 0 . . . g4131 0 0 … 0 1 1 … 1 g4132 0 0 … 0 1 1 … 1 Unsupervised Analysis  We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample.  Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis.  Many mature statistic methods can not be applied without the phenotypes of samples being known in advance. University at Buffalo The State University of New York Automatic PhenotypeAnalysis Structure Mining Unsupervised samples 1 2 3 4 5 6 7 8 9 10 gene1 Informative Genes gene2 gene3 gene4 Noninformative Genes gene5 gene6 gene7 An informative gene is a gene which manifests samples' phenotype distinction. Phenotype structure: sample partition + informative genes. University at Buffalo The State University of New York Automatic Phenotype Structure Mining Gene expression matrix Result Mining Phenotype distinction 1 2 3 4 5 6 7 gene1 gene2 gene3 Mining Informative genes Given a n  m data matrix M and the number of samples' phenotypes K. The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction. University at Buffalo The State University of New York Requirements The expression levels of each informative gene should be similar over the samples within each phenotype The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes University at Buffalo The State University of New York Challenges (1) The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques. University at Buffalo The State University of New York Challenges (2) gene1 gene2 gene3 gene5 gene4 gene9 gene5 gene12 gene6 gene7 gene8 gene9 The limited informative gene10 genes are buried in large gene11 gene12 gene13 gene14 gene15 University at Buffalo The State University of New York amount of noise. Challenges (3) Gene LTC4 synthase U50136 Gene Fumarylacetoacetate M55150 Gene C-myb U22376 Gene PROTEASOME IOTA X59417 The values within data matrices are all real numbers None of the informative genes follows ideal “high-`low” pattern. University at Buffalo The State University of New York Related Work  New tools using traditional methods : TreeView CLUTO CIT • SOM • K-means CNIO • Hierarchical clustering GeneSpring • Graph based clustering J-Express • PCA CLUSFAVOR  The similarity measures used in these methods are based on the full gene space.  PCs do not necessarily have strong correlation with informative genes. University at Buffalo The State University of New York Related Work (Cont’d)  Clustering with feature selection: (CLIFF, two-way ordering, SamCluster) 1. Filtering the invariant genes • Rank variance • PCA • CV 2. Partition the samples • Ncut, Min-Max Cut • Hierarchical Clustering 3. Pruning genes based on the partition • Markov blanket filter • T-test University at Buffalo The State University of New York Related Work (Cont’d)  Subspace clustering : • • Bi-clustering δ-clustering University at Buffalo The State University of New York Related Work (Cont’d)  Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype. University at Buffalo The State University of New York Related Work (Cont’d)  Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes. University at Buffalo The State University of New York Our Contributions We transferred the phenotype structure mining problem into an optimization problem. A series of statistic-based metrics are defined as objective functions. A heuristic searching method and a mutual reinforcing adjustment approach are proposed to find phenotype structures. University at Buffalo The State University of New York Model - Measurements Inter-divergency S1 S2 samples gene1 G’ gene2 gene3 Intra-consistency Intra-consistency University at Buffalo The State University of New York Phenotyp e Quality Intra-consistency NOT consistent Measure- Data(A) ment Data(B) residue 0.1975 0.4506 MSR 0.0494 0.4012 Ours 339.0667 5.3000 consistent University at Buffalo The State University of New York Intra-pattern-consistency (Cont’d) In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?  Variance of a single gene on the samples within one phenotype: 1 2 Var (i, S ' )  ( w  w )  i, j i,S ' S '  1 jS '  Intra-pattern-consistency: average row variance Con(G ' , S ' )  1 G'  Var (i, S ' )  giG ' 1 2  ( w  w ) .   i, j i ,S '   G '   S '  1 giG 's jS ' Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better. University at Buffalo The State University of New York Inter-pattern-divergence How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?  Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected.  Average block distance: Div (G ' , S 1, S 2))  w  giG ' i , S1  wi , S 2 G' Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better. University at Buffalo The State University of New York Pattern Quality  The purpose of pattern discovery is to identify the empirical patterns where the intra-patternconsistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large. 1 Con(G ' , Si )  Con(G ' , S j )   S i , S j (1i , j  K ;i  j ) Div (G ' , Si , Sj ) The higher the value, the better the quality. University at Buffalo The State University of New York Measurements  Intra-consistency 1 2 Con(G' , S ' )  ( w  w ) .   i, j i ,S '   G'   S '  1 giG 's jS '  Inter-divergence: Div (G ' , S 1, S 2))  w  giG ' i , S1  wi , S 2 G'  Phenotype Quality 1 Con(G ' , Si )  Con(G ' , S j )   S i , S j (1i , j  K ;i  j ) University at Buffalo The State University of New York Div (G ' , Si , Sj ) Phenotype Quality Data(A) Data(B) Data(C) Con 4.25 3.44 4.52 Div 41.60 25.20 46.16  14.2687 9.6074 15.3526 Highest phenotype quality University at Buffalo The State University of New York Model - Formalized Problem  Input m samples and n genes the corresponding gene expression matrix M the number of phenotypes K  Output  A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality  is maximized. University at Buffalo The State University of New York Strategy  Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution.  Basic elements:  A candidate structure:  A partition of samples {S1,S2,…Sk}  A subset of genes G’G  The corresponding phenotype quality   An adjustment:   For a gene gi G’, insert into G’   For a gene gi G’, remove from G’   For a sample s i in a group S’, move to other group  The quality gain measures the change of phenotype quality of before and after the adjustment. University at Buffalo The State University of New York Heuristic Searching candidate structure generation Iterative Adjusting pick up an object intermediate candidate structure gene/sample adjustment N Ω > 0 p  exp(  )   T (i ) Y adjusting University at Buffalo The State University of New York Heuristic Searching  Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space.  Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.)  for each gene, try possible insert/remove  for each sample, try best movement. Insert a gene Remove a gene University at Buffalo The State University of New York Move a sample Heuristic Search For each possible adjustment, compute  For each gene, try possible insert/remove For each sample, try the best movement  > 0  conduct the adjustment  < 0  conduct the adjustment with probability p  exp(  )   T (i ) T(i) is a decreasing simulated annealing function and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation University at Buffalo The State University of New York Mutual Reinforcing Adjustment - Motivation  Drawbacks of the heuristic searching method: blind initialization , equal chance of samples and genes, noisy samples.  The phenotype quality value of subset of informative genes and partially phenotype should also be high.  Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected.  Mining of phenotypes and informative genes should mutually reinforced. University at Buffalo The State University of New York Mutual Reinforcing Adjustment - Motivation A University at Buffalo The State University of New York B C Mutual Reinforcing Adjustment - Major Steps  Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes.  Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples.  compute reference degree for each sample groups;  select k groups of samples;  do partition adjustment.  Gene Adjustment: adjust the candidate informative genes.  compute  for reference partition on G  perform possible adjustment of each genes  Refinement Phase University at Buffalo The State University of New York Method Detail - Iteration Phase all samples informative genes G’ reference partition detection reference partition all samples to next iteration University at Buffalo The State University of New York reference partition informative genes G’ partitioning the matrix informative genes G’’ informative genes G’’ informative genes G’ all samples gene adjustment Partitioning the Matrix  Partition the samples and genes into multiple groups  Use CAST A threshold t decide the size of each group  Based on the Pearson’s correlation Coefficient  k  X ,Y  i 1 ( xi  x )( yi  y ) 2     x  x y  y i1 i i1 i k 2 k  Outliers will be filtered out from any group  Samples or genes in the same group share similar patterns University at Buffalo The State University of New York Reference Partition Detection Select the groups of samples as potential phenotypes Pick the first group with the highest reference degree 1 ref ( S j )  log S j  Con(G , Sj ) Gi G ' i Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups x 1 Div (Gi, Spx, Spt )  t 0 Ran ( S px )  log Spx  Con(Gi, Spx) GiG ' University at Buffalo The State University of New York Check the Missing Samples Probabilistically insert the remaining samples not in the selected groups into the most probably matching group In iterations, use the gene candidate sets to improve the reference partition University at Buffalo The State University of New York Gene Adjustment  Gene adjustment: Test the possible adjustments that lead to improvement Insert a gene University at Buffalo The State University of New York Remove a gene Method-Refinement Phase  The partition corresponding to the best state may not cover all the samples.  Add every sample not covered by the reference partition into its matching group  the phenotypes of the samples.  Then, a gene adjustment phase is conducted. We execute all adjustments with a positive quality gain  informative space.  Time complexity O(n*m2*I) University at Buffalo The State University of New York Mining Multiple Phenotype Structures samples Empirical Phenotype Structure 1 2 3 4 5 6 7 8 9 10 gene1 gene2 gene3 gene4 gene6 gene7 Hidden Phenotype Structure gene8 gene9 Output: p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized. University at Buffalo The State University of New York Extended Algorithm Strategy  Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution.  Basic elements of each candidate structure:  A candidate structure  A Kt partition of samples  A subset of genes G’G  The corresponding phenotype quality t  An adjustment  For a gene gi Gt, insert into Gt  For a gene gi Gt, move from Gt’ (tt’) or remove from all structures  For a sample si in group S’, move to other group  The quality gain measures the change of pattern quality of the states after the adjustment. University at Buffalo The State University of New York The Extended Algorithm (Cont’d)  Gene insert move remove move Sample candidate structure 1 candidate structure 2 University at Buffalo The State University of New York Mining Multiple Phenotype Structures (Cont’d)  Partially informative genes University at Buffalo The State University of New York Formalized Problem  Input •m samples and n genes •the corresponding gene expression matrix M •the number of phenotype structures p •the set of numbers {K1, K2, …, Kp}  Output p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized. University at Buffalo The State University of New York The Algorithm  Candidate Structure Generation  cluster genes into p’ group (p’>p) (CAST)  generate sample partitions one by one on clusters of genes, select best quality genes.  Iterative Adjustment  for each gene, try possible insert/move/remove  for each sample, - examine all possible adjustment - select best movement. University at Buffalo The State University of New York The Algorithm (Cont’d)  Gene (p possible adjustments) insert remove  Sample (Kt-1 possible adjustments for each partition) University at Buffalo The State University of New York move The Algorithm (Cont’d)  Data Standardization the original gene intensity values relative values ' i, j w  wi , j  wi i  j 1 wi, j , where wi  m  m m ; i  2 ( w  w ) i , j i j 1 m 1  Random order of genes and samples  Conduct negative action with a probability  Simulated annealing technique  p  exp( )   T (i ) 1 T (0)  1; T (i )  . 1 i University at Buffalo The State University of New York Experiments  Data Sets: Multiple-sclerosis data MS-IFN : 4132 * 28 (14 MS vs. 14 IFN) MS-CON : 4132 * 30 (15 MS vs. 15 Control) Leukemia data 7129 * 38 (27 ALL vs. 11 AML) 7129 * 34 (20 ALL vs. 14 AML) Colon cancer data 2000 * 62 (22 normal vs. 40 tumor colon tissue) Hereditary breast cancer data 3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics) University at Buffalo The State University of New York Rand Index  Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) :  “a” : the number of pairs of objects that are in the same class in P and in the same class in Q;  “b” : the number of pairs of objects that are in the same class in P but not in the same class in Q;  “c” : the number of pairs of objects that are in the same class in Q but not in the same class in P;  “d” : the number of pairs of objects that are in different classes in P and in different class in Q. ad RI  abcd University at Buffalo The State University of New York P Q s 1 s2 s1 s2 s 1 s2 s1 s2 s1 s2 s1 s2 s1 s2 s1 s2 Phenotype Structure Detection Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Data Size 4132*28 4132*30 7129*38 7129*34 J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112 CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364 CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844 CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112 CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844 -cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719 Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638 Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749 University at Buffalo The State University of New York Colon Breast 2000*62 3226*22 Experiments Number of iterations Running time Data Size mean standard deviation mean standard deviation 4132*28 158 27.2 180 35.1 4132*30 168 29.5 195 37.8 7129*38 171 16.1 436 51.9 7129*34 198 35.9 458 101.2 2000*62 133 17.8 479 98.5 3226*22 157 22.2 167 35.6 The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size. University at Buffalo The State University of New York Phenotype Structure Detection (Cont’d) Experimental Results (5)  The mutual reinforcing approach as applied to the MS-IFN group.  (A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to twodimensional space.  (B) shows the distribution in the middle of the adjustment.  (C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space. University at Buffalo The State University of New York Informative Gene Selection Experimental Results (5) University at Buffalo The State University of New York Phenotype Structures University at Buffalo The State University of New York Informative Gene Selection Experimental Results (Cont’d) (5) University at Buffalo The State University of New York Scalability Evaluation Experimental Results (5) University at Buffalo The State University of New York Conclusion from the Experiments  The work is motivated by the needs of emerging microarray data analysis.  The strategy is designed for data which have the following properties: The number of samples is limited but the gene dimension is very large. Large volumes of irrelevant and redundant genes prevent accurate grouping of samples; Analyzing over one dimension object can enhance detecting meaningful patterns of another dimension. University at Buffalo The State University of New York

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene - University at Buffalo